Supervised Learning : Leveraging Ensemble Learning With Bagging, Boosting, Stacking and Blending Approaches¶


John Pauline Pineda

March 9, 2025


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
      • 1.4.1 Data Splitting
      • 1.4.2 Data Profiling
      • 1.4.3 Category Aggregation and Encoding
      • 1.4.4 Outlier and Distributional Shape Analysis
      • 1.4.5 Collinearity
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Premodelling Data Preparation
      • 1.6.1 Preprocessed Data Description
      • 1.6.2 Preprocessing Pipeline Development
    • 1.7 Bagged Model Development
      • 1.7.1 Random Forest
      • 1.7.2 Extra Trees
      • 1.7.3 Bagged Decision Tree
      • 1.7.4 Bagged Logistic Regression
      • 1.7.5 Bagged Support Vector Machine
    • 1.8 Boosted Model Development
      • 1.8.1 AdaBoost
      • 1.8.2 Gradient Boosting
      • 1.8.3 XGBoost
      • 1.8.4 Light GBM
      • 1.8.5 CatBoost
    • 1.9 Stacked Model Development
      • 1.9.1 Base Learner - K-Nearest Neighbors
      • 1.9.2 Base Learner - Support Vector Machine
      • 1.9.3 Base Learner - Ridge Classifier
      • 1.9.4 Base Learner - Neural Network
      • 1.9.5 Base Learner - Decision Tree
      • 1.9.6 Meta Learner - Logistic Regression
    • 1.10 Blended Model Development
      • 1.10.1 Base Learner - K-Nearest Neighbors
      • 1.10.2 Base Learner - Support Vector Machine
      • 1.10.3 Base Learner - Ridge Classifier
      • 1.10.4 Base Learner - Neural Network
      • 1.10.5 Base Learner - Decision Tree
      • 1.10.6 Meta Learner - Logistic Regression
    • 1.11 Consolidated Findings
  • 2. Summary
  • 3. References

1. Table of Contents ¶

This project explores different Ensemble Learning approaches which combine the predictions from multiple models in an effort to achieve better predictive performance using various helpful packages in Python. The ensemble frameworks applied in the analysis were grouped into three classes including the Bagging Approach which fits many individual learners on different samples of the same dataset and averages the predictions; Boosting Approach which adds ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions; and Stacking or Blending Approach which consolidates many different and diverse learners on the same data and uses another model to learn how to best combine the predictions. Bagged models applied were the Random Forest, Extra Trees, Bagged Decision Tree, Bagged Logistic Regression and Bagged Support Vector Machine algorithms. Boosting models included the AdaBoost, Stochastic Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting Machines and CatBoost algorithms. Individual base learners including the K-Nearest Neighbors, Support Vector Machine, Ridge Classifier, Neural Network and Decision Tree algorithms were stacked or blended together as contributors to the Logistic Regression meta-model. The resulting predictions derived from all ensemble learning models were independtly evaluated on a test set based on accuracy and F1 score metrics. All results were consolidated in a Summary presented at the end of the document.

Ensemble Learning is a machine learning technique that improves predictive accuracy by combining multiple models to leverage their collective strengths. Traditional machine learning models often struggle with either high bias, which leads to overly simplistic predictions, or high variance, which makes them too sensitive to fluctuations in the data. Ensemble learning addresses these challenges by aggregating the outputs of several models, creating a more robust and reliable predictor. In classification problems, this can be done through majority voting, weighted averaging, or more advanced meta-learning techniques. The key advantage of ensemble learning is its ability to reduce both bias and variance, leading to better generalization on unseen data. However, this comes at the cost of increased computational complexity and interpretability, as managing multiple models requires more resources and makes it harder to explain predictions.

Bagging, or Bootstrap Aggregating, is an ensemble learning technique that reduces model variance by training multiple instances of the same algorithm on different randomly sampled subsets of the training data. The fundamental problem bagging aims to solve is overfitting, particularly in high-variance models. By generating multiple bootstrap samples—random subsets created through sampling with replacement — bagging ensures that each model is trained on slightly different data, making the overall prediction more stable. In classification problems, the final output is obtained by majority voting among the individual models, while in regression, their predictions are averaged. Bagging is particularly effective when dealing with noisy datasets, as it smooths out individual model errors. However, its effectiveness is limited for low-variance models, and the requirement to train multiple models increases computational cost.

Boosting is an ensemble learning method that builds a strong classifier by training models sequentially, where each new model focuses on correcting the mistakes of its predecessors. Boosting assigns higher weights to misclassified instances, ensuring that subsequent models pay more attention to these hard-to-classify cases. The motivation behind boosting is to reduce both bias and variance by iteratively refining weak learners — models that perform only slightly better than random guessing — until they collectively form a strong classifier. In classification tasks, predictions are refined by combining weighted outputs of multiple weak models, typically decision stumps or shallow trees. This makes boosting highly effective in uncovering complex patterns in data. However, the sequential nature of boosting makes it computationally expensive compared to bagging, and it is more prone to overfitting if the number of weak learners is too high.

Stacking, or stacked generalization, is an advanced ensemble method that improves predictive performance by training a meta-model to learn the optimal way to combine multiple base models using their out-of-fold predictions. Unlike traditional ensemble techniques such as bagging and boosting, which aggregate predictions through simple rules like averaging or majority voting, stacking introduces a second-level model that intelligently learns how to integrate diverse base models. The process starts by training multiple classifiers on the training dataset. However, instead of directly using their predictions, stacking employs k-fold cross-validation to generate out-of-fold predictions. Specifically, each base model is trained on a subset of the training data while leaving out a validation fold, and predictions on that unseen fold are recorded. This process is repeated across all folds, ensuring that each instance in the training data receives predictions from models that never saw it during training. These out-of-fold predictions are then used as input features for a meta-model, which learns the best way to combine them into a final decision. The advantage of stacking is that it allows different models to complement each other, capturing diverse aspects of the data that a single model might miss. This often results in superior classification accuracy compared to individual models or simpler ensemble approaches. However, stacking is computationally expensive, requiring multiple training iterations for base models and the additional meta-model. It also demands careful tuning to prevent overfitting, as the meta-model’s complexity can introduce new sources of error. Despite these challenges, stacking remains a powerful technique in applications where maximizing predictive performance is a priority.

Blending is an ensemble technique that enhances classification accuracy by training a meta-model on a holdout validation set, rather than using out-of-fold predictions like stacking. This simplifies implementation while maintaining the benefits of combining multiple base models. The process of blending starts by training base models on the full training dataset. Instead of applying cross-validation to obtain out-of-fold predictions, blending reserves a small portion of the training data as a holdout set. The base models make predictions on this unseen holdout set, and these predictions are then used as input features for a meta-model, which learns how to optimally combine them into a final classification decision. Since the meta-model is trained on predictions from unseen data, it avoids the risk of overfitting that can sometimes occur when base models are evaluated on the same data they were trained on. Blending is motivated by its simplicity and ease of implementation compared to stacking, as it eliminates the need for repeated k-fold cross-validation to generate training data for the meta-model. However, one drawback is that the meta-model has access to fewer training examples, as a portion of the data is withheld for validation rather than being used for training. This can limit the generalization ability of the final model, especially if the holdout set is too small. Despite this limitation, blending remains a useful approach in applications where a quick and effective ensemble method is needed without the computational overhead of stacking.

1.1. Data Background ¶

An open Thyroid Disease Dataset from Kaggle (with all credits attributed to Jai Naru and Abuchi Onwuegbusi) was used for the analysis as consolidated from the following primary sources:

  1. Reference Repository entitled Differentiated Thyroid Cancer Recurrence from UC Irvine Machine Learning Repository
  2. Research Paper entitled Machine Learning for Risk Stratification of Thyroid Cancer Patients: a 15-year Cohort Study from the European Archives of Oto-Rhino-Laryngology

This study hypothesized that the various clinicopathological characteristics influence differentiated thyroid cancer recurrence between patients.

The dichotomous categorical variable for the study is:

  • Recurred - Status of the patient (Yes, Recurrence of differentiated thyroid cancer | No, No recurrence of differentiated thyroid cancer)

The predictor variables for the study are:

  • Age - Patient's age (Years)
  • Gender - Patient's sex (M | F)
  • Smoking - Indication of smoking (Yes | No)
  • Hx Smoking - Indication of smoking history (Yes | No)
  • Hx Radiotherapy - Indication of radiotherapy history for any condition (Yes | No)
  • Thyroid Function - Status of thyroid function (Clinical Hyperthyroidism, Hypothyroidism | Subclinical Hyperthyroidism, Hypothyroidism | Euthyroid)
  • Physical Examination - Findings from physical examination including palpation of the thyroid gland and surrounding structures (Normal | Diffuse Goiter | Multinodular Goiter | Single Nodular Goiter Left, Right)
  • Adenopathy - Indication of enlarged lymph nodes in the neck region (No | Right | Extensive | Left | Bilateral | Posterior)
  • Pathology - Specific thyroid cancer type as determined by pathology examination of biopsy samples (Follicular | Hurthel Cell | Micropapillary | Papillary)
  • Focality - Indication if the cancer is limited to one location or present in multiple locations (Uni-Focal | Multi-Focal)
  • Risk - Risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type (Low | Intermediate | High)
  • T - Tumor classification based on its size and extent of invasion into nearby structures (T1a | T1b | T2 | T3a | T3b | T4a | T4b)
  • N - Nodal classification indicating the involvement of lymph nodes (N0 | N1a | N1b)
  • M - Metastasis classification indicating the presence or absence of distant metastases (M0 | M1)
  • Stage - Overall stage of the cancer, typically determined by combining T, N, and M classifications (I | II | III | IVa | IVb)
  • Response - Cancer's response to treatment (Biochemical Incomplete | Indeterminate | Excellent | Structural Incomplete)

1.2. Data Description ¶

  1. The initial tabular dataset was comprised of 383 observations and 17 variables (including 1 target and 16 predictors).
    • 383 rows (observations)
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 16/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
In [1]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import itertools
import os
import pickle
%matplotlib inline

from operator import add,mul,truediv
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold, KFold, cross_val_score
In [2]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
In [3]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
thyroid_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Thyroid_Diff.csv"))
In [4]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(thyroid_cancer.shape)
Dataset Dimensions: 
(383, 17)
In [5]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(thyroid_cancer.dtypes)
Column Names and Data Types:
Age                      int64
Gender                  object
Smoking                 object
Hx Smoking              object
Hx Radiotherapy         object
Thyroid Function        object
Physical Examination    object
Adenopathy              object
Pathology               object
Focality                object
Risk                    object
T                       object
N                       object
M                       object
Stage                   object
Response                object
Recurred                object
dtype: object
In [6]:
##################################
# Renaming and standardizing the column names
# to replace blanks with undercores
##################################
thyroid_cancer.columns = thyroid_cancer.columns.str.replace(" ", "_")
In [7]:
##################################
# Taking a snapshot of the dataset
##################################
thyroid_cancer.head()
Out[7]:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
0 27 F No No No Euthyroid Single nodular goiter-left No Micropapillary Uni-Focal Low T1a N0 M0 I Indeterminate No
1 34 F No Yes No Euthyroid Multinodular goiter No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
2 30 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
3 62 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
4 62 F No No No Euthyroid Multinodular goiter No Micropapillary Multi-Focal Low T1a N0 M0 I Excellent No
In [8]:
##################################
# Selecting categorical columns (both object and categorical types)
# and listing the unique categorical levels
##################################
cat_cols = thyroid_cancer.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
    print(f"Categorical | Object Column: {col}")
    print(thyroid_cancer[col].unique())  
    print("-" * 40)
    
Categorical | Object Column: Gender
['F' 'M']
----------------------------------------
Categorical | Object Column: Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Smoking
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Hx_Radiotherapy
['No' 'Yes']
----------------------------------------
Categorical | Object Column: Thyroid_Function
['Euthyroid' 'Clinical Hyperthyroidism' 'Clinical Hypothyroidism'
 'Subclinical Hyperthyroidism' 'Subclinical Hypothyroidism']
----------------------------------------
Categorical | Object Column: Physical_Examination
['Single nodular goiter-left' 'Multinodular goiter'
 'Single nodular goiter-right' 'Normal' 'Diffuse goiter']
----------------------------------------
Categorical | Object Column: Adenopathy
['No' 'Right' 'Extensive' 'Left' 'Bilateral' 'Posterior']
----------------------------------------
Categorical | Object Column: Pathology
['Micropapillary' 'Papillary' 'Follicular' 'Hurthel cell']
----------------------------------------
Categorical | Object Column: Focality
['Uni-Focal' 'Multi-Focal']
----------------------------------------
Categorical | Object Column: Risk
['Low' 'Intermediate' 'High']
----------------------------------------
Categorical | Object Column: T
['T1a' 'T1b' 'T2' 'T3a' 'T3b' 'T4a' 'T4b']
----------------------------------------
Categorical | Object Column: N
['N0' 'N1b' 'N1a']
----------------------------------------
Categorical | Object Column: M
['M0' 'M1']
----------------------------------------
Categorical | Object Column: Stage
['I' 'II' 'IVB' 'III' 'IVA']
----------------------------------------
Categorical | Object Column: Response
['Indeterminate' 'Excellent' 'Structural Incomplete'
 'Biochemical Incomplete']
----------------------------------------
Categorical | Object Column: Recurred
['No' 'Yes']
----------------------------------------
In [9]:
##################################
# Correcting a category level
##################################
thyroid_cancer["Pathology"] = thyroid_cancer["Pathology"].replace("Hurthel cell", "Hurthle Cell")
In [10]:
##################################
# Setting the levels of the categorical variables
##################################
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].astype('category')
thyroid_cancer['Recurred'] = thyroid_cancer['Recurred'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].astype('category')
thyroid_cancer['Gender'] = thyroid_cancer['Gender'].cat.set_categories(['M', 'F'], ordered=True)
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].astype('category')
thyroid_cancer['Smoking'] = thyroid_cancer['Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].astype('category')
thyroid_cancer['Hx_Smoking'] = thyroid_cancer['Hx_Smoking'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].astype('category')
thyroid_cancer['Hx_Radiotherapy'] = thyroid_cancer['Hx_Radiotherapy'].cat.set_categories(['No', 'Yes'], ordered=True)
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].astype('category')
thyroid_cancer['Thyroid_Function'] = thyroid_cancer['Thyroid_Function'].cat.set_categories(['Euthyroid', 'Subclinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Clinical Hyperthyroidism'], ordered=True)
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].astype('category')
thyroid_cancer['Physical_Examination'] = thyroid_cancer['Physical_Examination'].cat.set_categories(['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right', 'Multinodular goiter', 'Diffuse goiter'], ordered=True)
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].astype('category')
thyroid_cancer['Adenopathy'] = thyroid_cancer['Adenopathy'].cat.set_categories(['No', 'Left', 'Right', 'Bilateral', 'Posterior', 'Extensive'], ordered=True)
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].astype('category')
thyroid_cancer['Pathology'] = thyroid_cancer['Pathology'].cat.set_categories(['Hurthle Cell', 'Follicular', 'Micropapillary', 'Papillary'], ordered=True)
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].astype('category')
thyroid_cancer['Focality'] = thyroid_cancer['Focality'].cat.set_categories(['Uni-Focal', 'Multi-Focal'], ordered=True)
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].astype('category')
thyroid_cancer['Risk'] = thyroid_cancer['Risk'].cat.set_categories(['Low', 'Intermediate', 'High'], ordered=True)
thyroid_cancer['T'] = thyroid_cancer['T'].astype('category')
thyroid_cancer['T'] = thyroid_cancer['T'].cat.set_categories(['T1a', 'T1b', 'T2', 'T3a', 'T3b', 'T4a', 'T4b'], ordered=True)
thyroid_cancer['N'] = thyroid_cancer['N'].astype('category')
thyroid_cancer['N'] = thyroid_cancer['N'].cat.set_categories(['N0', 'N1a', 'N1b'], ordered=True)
thyroid_cancer['M'] = thyroid_cancer['M'].astype('category')
thyroid_cancer['M'] = thyroid_cancer['M'].cat.set_categories(['M0', 'M1'], ordered=True)
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].astype('category')
thyroid_cancer['Stage'] = thyroid_cancer['Stage'].cat.set_categories(['I', 'II', 'III', 'IVA', 'IVB'], ordered=True)
thyroid_cancer['Response'] = thyroid_cancer['Response'].astype('category')
thyroid_cancer['Response'] = thyroid_cancer['Response'].cat.set_categories(['Excellent', 'Structural Incomplete', 'Biochemical Incomplete', 'Indeterminate'], ordered=True)
In [11]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(thyroid_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
Age 383.0 40.866841 15.134494 15.0 29.0 37.0 51.0 82.0
In [12]:
##################################
# Performing a general exploration of the categorical variables
##################################
print('Categorical Variable Summary:')
display(thyroid_cancer.describe(include='category').transpose())
Categorical Variable Summary:
count unique top freq
Gender 383 2 F 312
Smoking 383 2 No 334
Hx_Smoking 383 2 No 355
Hx_Radiotherapy 383 2 No 376
Thyroid_Function 383 5 Euthyroid 332
Physical_Examination 383 5 Single nodular goiter-right 140
Adenopathy 383 6 No 277
Pathology 383 4 Papillary 287
Focality 383 2 Uni-Focal 247
Risk 383 3 Low 249
T 383 7 T2 151
N 383 3 N0 268
M 383 2 M0 365
Stage 383 5 I 333
Response 383 4 Excellent 208
Recurred 383 2 No 275
In [13]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer[col].value_counts().reindex(thyroid_cancer[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer[col].value_counts(normalize=True).reindex(thyroid_cancer[col].cat.categories))
    print("-" * 50)
   
Column: Gender
Absolute Frequencies:
M     71
F    312
Name: count, dtype: int64

Normalized Frequencies:
M    0.185379
F    0.814621
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     334
Yes     49
Name: count, dtype: int64

Normalized Frequencies:
No     0.872063
Yes    0.127937
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Smoking
Absolute Frequencies:
No     355
Yes     28
Name: count, dtype: int64

Normalized Frequencies:
No     0.926893
Yes    0.073107
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Radiotherapy
Absolute Frequencies:
No     376
Yes      7
Name: count, dtype: int64

Normalized Frequencies:
No     0.981723
Yes    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                      332
Subclinical Hypothyroidism      14
Subclinical Hyperthyroidism      5
Clinical Hypothyroidism         12
Clinical Hyperthyroidism        20
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                      0.866841
Subclinical Hypothyroidism     0.036554
Subclinical Hyperthyroidism    0.013055
Clinical Hypothyroidism        0.031332
Clinical Hyperthyroidism       0.052219
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Normal                           7
Single nodular goiter-left      89
Single nodular goiter-right    140
Multinodular goiter            140
Diffuse goiter                   7
Name: count, dtype: int64

Normalized Frequencies:
Normal                         0.018277
Single nodular goiter-left     0.232376
Single nodular goiter-right    0.365535
Multinodular goiter            0.365535
Diffuse goiter                 0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No           277
Left          17
Right         48
Bilateral     32
Posterior      2
Extensive      7
Name: count, dtype: int64

Normalized Frequencies:
No           0.723238
Left         0.044386
Right        0.125326
Bilateral    0.083551
Posterior    0.005222
Extensive    0.018277
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Hurthle Cell       20
Follicular         28
Micropapillary     48
Papillary         287
Name: count, dtype: int64

Normalized Frequencies:
Hurthle Cell      0.052219
Follicular        0.073107
Micropapillary    0.125326
Papillary         0.749347
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      247
Multi-Focal    136
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.644909
Multi-Focal    0.355091
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Low             249
Intermediate    102
High             32
Name: count, dtype: int64

Normalized Frequencies:
Low             0.650131
Intermediate    0.266319
High            0.083551
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1a     49
T1b     43
T2     151
T3a     96
T3b     16
T4a     20
T4b      8
Name: count, dtype: int64

Normalized Frequencies:
T1a    0.127937
T1b    0.112272
T2     0.394256
T3a    0.250653
T3b    0.041775
T4a    0.052219
T4b    0.020888
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0     268
N1a     22
N1b     93
Name: count, dtype: int64

Normalized Frequencies:
N0     0.699739
N1a    0.057441
N1b    0.242820
Name: proportion, dtype: float64
--------------------------------------------------
Column: M
Absolute Frequencies:
M0    365
M1     18
Name: count, dtype: int64

Normalized Frequencies:
M0    0.953003
M1    0.046997
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I      333
II      32
III      4
IVA      3
IVB     11
Name: count, dtype: int64

Normalized Frequencies:
I      0.869452
II     0.083551
III    0.010444
IVA    0.007833
IVB    0.028721
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                 208
Structural Incomplete      91
Biochemical Incomplete     23
Indeterminate              61
Name: count, dtype: int64

Normalized Frequencies:
Excellent                 0.543081
Structural Incomplete     0.237598
Biochemical Incomplete    0.060052
Indeterminate             0.159269
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     275
Yes    108
Name: count, dtype: int64

Normalized Frequencies:
No     0.718016
Yes    0.281984
Name: proportion, dtype: float64
--------------------------------------------------

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. A total of 19 duplicated rows were identified.
    • In total, 34 observations were affected, consisting of 16 unique occurrences and 19 subsequent duplicates.
    • These 19 duplicates spanned 16 distinct variations, meaning some variations had multiple duplicates.
    • To clean the dataset, all 19 duplicate rows were removed, retaining only the first occurrence of each of the 16 unique variations.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. Low variance observed for 8 variables with First.Second.Mode.Ratio>5.
    • Hx_Radiotherapy: First.Second.Mode.Ratio = 51.000 (comprised 2 category levels)
    • M: First.Second.Mode.Ratio = 19.222 (comprised 2 category levels)
    • Thyroid_Function: First.Second.Mode.Ratio = 15.650 (comprised 5 category levels)
    • Hx_Smoking: First.Second.Mode.Ratio = 12.000 (comprised 2 category levels)
    • Stage: First.Second.Mode.Ratio = 9.812 (comprised 5 category levels)
    • Smoking: First.Second.Mode.Ratio = 6.428 (comprised 2 category levels)
    • Pathology: First.Second.Mode.Ratio = 6.022 (comprised 4 category levels)
    • Adenopathy: First.Second.Mode.Ratio = 5.375 (comprised 5 category levels)
  4. No low variance observed for any variable with Unique.Count.Ratio>10.
  5. No high skewness observed for any variable with Skewness>3 or Skewness<(-3).
In [14]:
##################################
# Counting the number of duplicated rows
##################################
thyroid_cancer.duplicated().sum()
Out[14]:
19
In [15]:
##################################
# Exploring the duplicated rows
##################################
duplicated_rows = thyroid_cancer[thyroid_cancer.duplicated(keep=False)]
display(duplicated_rows)
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred
8 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
9 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
22 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
32 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
38 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
40 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No
61 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
66 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
67 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
69 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
73 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
77 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No
106 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
110 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
113 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
115 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
119 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
120 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
121 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
123 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
132 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
136 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
137 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
138 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
142 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
161 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
166 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
168 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
170 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
175 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
178 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
183 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
187 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
189 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
196 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No
In [16]:
##################################
# Checking if duplicated rows have identical values across all columns
##################################
num_unique_dup_rows = duplicated_rows.drop_duplicates().shape[0]
num_total_dup_rows = duplicated_rows.shape[0]
if num_unique_dup_rows == 1:
    print("All duplicated rows have the same values across all columns.")
else:
    print(f"There are {num_unique_dup_rows} unique versions among the {num_total_dup_rows} duplicated rows.")
    
There are 16 unique versions among the 35 duplicated rows.
In [17]:
##################################
# Counting the unique variations among duplicated rows
##################################
unique_dup_variations = duplicated_rows.drop_duplicates()
variation_counts = duplicated_rows.value_counts().reset_index(name="Count")
print("Unique duplicated row variations and their counts:")
display(variation_counts)
Unique duplicated row variations and their counts:
Age Gender Smoking Hx_Smoking Hx_Radiotherapy Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N M Stage Response Recurred Count
0 26 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 4
1 32 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 3
2 21 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
3 22 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
4 28 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
5 29 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
6 31 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
7 34 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
8 35 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
9 36 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
10 37 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
11 38 F No No No Euthyroid Single nodular goiter-right No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
12 40 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
13 42 F No No No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 M0 I Excellent No 2
14 51 F No No No Euthyroid Single nodular goiter-left No Papillary Uni-Focal Low T1b N0 M0 I Excellent No 2
15 51 F No No No Euthyroid Single nodular goiter-right No Micropapillary Uni-Focal Low T1a N0 M0 I Excellent No 2
In [18]:
##################################
# Removing the duplicated rows and
# retaining only the first occurrence
##################################
thyroid_cancer_row_filtered = thyroid_cancer.drop_duplicates(keep="first")
print('Dataset Dimensions: ')
display(thyroid_cancer_row_filtered.shape)
Dataset Dimensions: 
(364, 17)
In [19]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(thyroid_cancer_row_filtered.dtypes)
In [20]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(thyroid_cancer_row_filtered.columns)
In [21]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(thyroid_cancer_row_filtered)] * len(thyroid_cancer_row_filtered.columns))
In [22]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(thyroid_cancer_row_filtered.isna().sum(axis=0))
In [23]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(thyroid_cancer_row_filtered.count())
In [24]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [25]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 Age int64 364 364 0 1.0
1 Gender category 364 364 0 1.0
2 Smoking category 364 364 0 1.0
3 Hx_Smoking category 364 364 0 1.0
4 Hx_Radiotherapy category 364 364 0 1.0
5 Thyroid_Function category 364 364 0 1.0
6 Physical_Examination category 364 364 0 1.0
7 Adenopathy category 364 364 0 1.0
8 Pathology category 364 364 0 1.0
9 Focality category 364 364 0 1.0
10 Risk category 364 364 0 1.0
11 T category 364 364 0 1.0
12 N category 364 364 0 1.0
13 M category 364 364 0 1.0
14 Stage category 364 364 0 1.0
15 Response category 364 364 0 1.0
16 Recurred category 364 364 0 1.0
In [26]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[26]:
0
In [27]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [28]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = thyroid_cancer_row_filtered.index
In [29]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(thyroid_cancer_row_filtered.columns)] * len(thyroid_cancer_row_filtered))
In [30]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(thyroid_cancer_row_filtered.isna().sum(axis=1))
In [31]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [32]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 0 17 0 0.0
1 1 17 0 0.0
2 2 17 0 0.0
3 3 17 0 0.0
4 4 17 0 0.0
... ... ... ... ...
359 378 17 0 0.0
360 379 17 0 0.0
361 380 17 0 0.0
362 381 17 0 0.0
363 382 17 0 0.0

364 rows × 4 columns

In [33]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[33]:
0
In [34]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
thyroid_cancer_numeric = thyroid_cancer_row_filtered.select_dtypes(include='number')
In [35]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_numeric.columns
In [36]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = thyroid_cancer_numeric.min()
In [37]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = thyroid_cancer_numeric.mean()
In [38]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = thyroid_cancer_numeric.median()
In [39]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = thyroid_cancer_numeric.max()
In [40]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0] for x in thyroid_cancer_numeric]
In [41]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1] for x in thyroid_cancer_numeric]
In [42]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_numeric]
In [43]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [thyroid_cancer_numeric[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_numeric]
In [44]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [45]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = thyroid_cancer_numeric.nunique(dropna=True)
In [46]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(thyroid_cancer_numeric)] * len(thyroid_cancer_numeric.columns))
In [47]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [48]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_numeric.skew()
In [49]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_numeric.kurtosis()
In [50]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 Age 15 41.25 38.0 82 31 27 21 13 1.615385 65 364 0.178571 0.678269 -0.359255
In [51]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[51]:
0
In [52]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[52]:
0
In [53]:
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[53]:
0
In [54]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
thyroid_cancer_categorical = thyroid_cancer_row_filtered.select_dtypes(include='category')
In [55]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = thyroid_cancer_categorical.columns
In [56]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[0] for x in thyroid_cancer_categorical]
In [57]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [thyroid_cancer_row_filtered[x].value_counts().index.tolist()[1] for x in thyroid_cancer_categorical]
In [58]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in thyroid_cancer_categorical]
In [59]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [thyroid_cancer_categorical[x].isin([thyroid_cancer_row_filtered[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in thyroid_cancer_categorical]
In [60]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [61]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = thyroid_cancer_categorical.nunique(dropna=True)
In [62]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(thyroid_cancer_categorical)] * len(thyroid_cancer_categorical.columns))
In [63]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [64]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 Gender F M 293 71 4.126761 2 364 0.005495
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
5 Physical_Examination Multinodular goiter Single nodular goiter-right 135 127 1.062992 5 364 0.013736
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
8 Focality Uni-Focal Multi-Focal 228 136 1.676471 2 364 0.005495
9 Risk Low Intermediate 230 102 2.254902 3 364 0.008242
10 T T2 T3a 138 96 1.437500 7 364 0.019231
11 N N0 N1b 249 93 2.677419 3 364 0.008242
12 M M0 M1 346 18 19.222222 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
14 Response Excellent Structural Incomplete 189 91 2.076923 4 364 0.010989
15 Recurred No Yes 256 108 2.370370 2 364 0.005495
In [65]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[65]:
8
In [66]:
##################################
# Identifying the categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
display(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
3 Hx_Radiotherapy No Yes 357 7 51.000000 2 364 0.005495
12 M M0 M1 346 18 19.222222 2 364 0.005495
4 Thyroid_Function Euthyroid Clinical Hyperthyroidism 313 20 15.650000 5 364 0.013736
2 Hx_Smoking No Yes 336 28 12.000000 2 364 0.005495
13 Stage I II 314 32 9.812500 5 364 0.013736
1 Smoking No Yes 315 49 6.428571 2 364 0.005495
7 Pathology Papillary Micropapillary 271 45 6.022222 4 364 0.010989
6 Adenopathy No Right 258 48 5.375000 6 364 0.016484
In [67]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[67]:
0

1.4. Data Preprocessing ¶

1.4.1 Data Splitting ¶

  1. The baseline dataset (with duplicate rows removed from the original dataset) is comprised of:
    • 364 rows (observations)
      • 256 Recurred=No: 70.33%
      • 108 Recurred=Yes: 29.67%
    • 17 columns (variables)
      • 1/17 target (categorical)
        • Recurred
      • 1/17 predictor (numeric)
        • Age
      • 15/17 predictor (categorical)
        • Gender
        • Smoking
        • Hx_Smoking
        • Hx_Radiotherapy
        • Thyroid_Function
        • Physical_Examination
        • Adenopathy
        • Pathology
        • Focality
        • Risk
        • T
        • N
        • M
        • Stage
        • Response
  2. The baseline dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  3. Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model were selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The train data (final) subset is comprised of:
    • 204 rows (observations)
      • 143 Recurred=No: 70.10%
      • 61 Recurred=Yes: 29.90%
    • 17 columns (variables)
  7. The validation data subset is comprised of:
    • 69 rows (observations)
      • 49 Recurred=No: 71.01%
      • 20 Recurred=Yes: 28.98%
    • 17 columns (variables)
  8. The test data subset is comprised of:
    • 91 rows (observations)
      • 64 Recurred=No: 70.33%
      • 27 Recurred=Yes: 29.67%
    • 17 columns (variables)
In [68]:
##################################
# Creating a dataset copy
# of the row filtered data
##################################
thyroid_cancer_baseline = thyroid_cancer_row_filtered.copy()
In [69]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(thyroid_cancer_baseline.shape)
Final Dataset Dimensions: 
(364, 17)
In [70]:
print('Target Variable Breakdown: ')
thyroid_cancer_breakdown = thyroid_cancer_baseline.groupby('Recurred', observed=True).size().reset_index(name='Count')
thyroid_cancer_breakdown['Percentage'] = (thyroid_cancer_breakdown['Count'] / len(thyroid_cancer_baseline)) * 100
display(thyroid_cancer_breakdown)
Target Variable Breakdown: 
Recurred Count Percentage
0 No 256 70.32967
1 Yes 108 29.67033
In [71]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train_initial, thyroid_cancer_test = train_test_split(thyroid_cancer_baseline, 
                                                               test_size=0.25, 
                                                               stratify=thyroid_cancer_baseline['Recurred'], 
                                                               random_state=88888888)
In [72]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = thyroid_cancer_train_initial.drop('Recurred', axis = 1)
y_train_initial = thyroid_cancer_train_initial['Recurred']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions: 
(273, 16)
(273,)
Initial Train Target Variable Breakdown: 
Recurred
No     192
Yes     81
Name: count, dtype: int64
Initial Train Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [73]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = thyroid_cancer_test.drop('Recurred', axis = 1)
y_test = thyroid_cancer_test['Recurred']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(91, 16)
(91,)
Test Target Variable Breakdown: 
Recurred
No     64
Yes    27
Name: count, dtype: int64
Test Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
In [74]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
thyroid_cancer_train, thyroid_cancer_validation = train_test_split(thyroid_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=thyroid_cancer_train_initial['Recurred'], 
                                                             random_state=88888888)
In [75]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = thyroid_cancer_train.drop('Recurred', axis = 1)
y_train = thyroid_cancer_train['Recurred']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions: 
(204, 16)
(204,)
Final Train Target Variable Breakdown: 
Recurred
No     143
Yes     61
Name: count, dtype: int64
Final Train Target Variable Proportion: 
Recurred
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
In [76]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = thyroid_cancer_validation.drop('Recurred', axis = 1)
y_validation = thyroid_cancer_validation['Recurred']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(69, 16)
(69,)
Validation Target Variable Breakdown: 
Recurred
No     49
Yes    20
Name: count, dtype: int64
Validation Target Variable Proportion: 
Recurred
No     0.710145
Yes    0.289855
Name: proportion, dtype: float64
In [77]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
thyroid_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "thyroid_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [78]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
thyroid_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "thyroid_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [79]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
thyroid_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "thyroid_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.4.2 Data Profiling ¶

  1. No distributional anomalies were obseerved for the numeric predictor Age.
  2. 9 categorical predictors were observed with categories consisting of too few cases that risk poor generalization and cross-validation issues:
    • Thyroid_Function:
      • 176 Thyroid_Function=Euthyroid: 86.27%
      • 8 Thyroid_Function=Subclinical Hypothyroidism: 3.92%
      • 3 Thyroid_Function=Subclinical Hyperthyroidism: 1.47%
      • 6 Thyroid_Function=Clinical Hypothyroidism: 2.94%
      • 11 Thyroid_Function=Clinical Hyperthyroidism: 5.39%
    • Physical_Examination:
      • 5 Physical_Examination=Normal: 2.45%
      • 47 Physical_Examination=Single nodular goiter-left: 23.04%
      • 69 Physical_Examination=Single nodular goiter-right: 33.82%
      • 78 Physical_Examination=Multinodular goiter: 38.24%
      • 5 Physical_Examination=Diffuse goiter: 2.45%
    • Adenopathy:
      • 143 Adenopathy=No: 70.09%
      • 8 Adenopathy=Left: 3.92%
      • 26 Adenopathy=Right: 12.75%
      • 1 Adenopathy=Posterior: 0.49%
      • 22 Adenopathy=Bilateral: 10.78%
      • 4 Adenopathy=Extensive: 1.96%
    • Pathology:
      • 11 Pathology=Hurthle Cell: 5.39%
      • 14 Pathology=Follicular: 6.86%
      • 28 Pathology=Micropapillary: 13.73%
      • 151 Pathology=Papillary: 74.02%
    • Risk:
      • 131 Risk=Low: 64.22%
      • 55 Risk=Intermediate: 26.96%
      • 18 Risk=High: 8.82%
    • T:
      • 27 T=T1a: 13.23%
      • 20 T=T1b: 9.80%
      • 79 T=T2: 38.72%
      • 55 T=T3a: 26.96%
      • 6 T=T3b: 2.94%
      • 11 T=T4a: 5.39%
      • 6 T=T4b: 2.94%
    • N:
      • 140 N=N0: 68.63%
      • 13 N=N1a: 6.37%
      • 51 N=N1b: 25.00%
    • Stage:
      • 178 Stage=I: 87.25%
      • 16 Stage=II: 7.84%
      • 1 Stage=III: 0.49%
      • 3 Stage=IVA: 1.47%
      • 6 Stage=IVB: 2.94%
    • Response:
      • 107 Response=Excellent: 52.45%
      • 54 Response=Structural Incomplete: 26.47%
      • 17 Response=Biochemical Incomplete: 8.33%
      • 26 Response=Indeterminate: 12.75%
  3. 3 categorical predictors were excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
    • Hx_Smoking:
      • 189 Hx_Smoking=No: 92.65%
      • 15 Hx_Smoking=Yes: 7.35%
    • Hx_Radiotherapy:
      • 199 Hx_Radiotherapy=No: 97.55%
      • 15 Hx_Radiotherapy=Yes: 2.45%
    • M:
      • 192 M=M0: 94.12%
      • 12 M=M1: 5.88%
In [80]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train.iloc[:,:-1].loc[:, thyroid_cancer_train.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train.iloc[:,:-1].loc[:,thyroid_cancer_train.iloc[:,:-1].columns != 'Age'].columns
In [81]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_predictors_numeric
In [82]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
histogram_grouping_variable = 'Recurred'
histogram_frequency_variable = numeric_variable_name_list.values[0]
In [83]:
##################################
# Comparing the numeric predictors
# grouped by the target variable
##################################
colors = plt.get_cmap('tab10').colors
plt.figure(figsize=(7, 5))
group_no = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'No'][histogram_frequency_variable]
group_yes = thyroid_cancer_train[thyroid_cancer_train[histogram_grouping_variable] == 'Yes'][histogram_frequency_variable]
plt.hist(group_no, bins=20, alpha=0.5, color=colors[0], label='No', edgecolor='black')
plt.hist(group_yes, bins=20, alpha=0.5, color=colors[1], label='Yes', edgecolor='black')
plt.title(f'{histogram_grouping_variable} Versus {histogram_frequency_variable}')
plt.xlabel(histogram_frequency_variable)
plt.ylabel('Frequency')
plt.legend()
plt.show()
No description has been provided for this image
In [84]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer_train[col].value_counts().reindex(thyroid_cancer_train[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer_train[col].value_counts(normalize=True).reindex(thyroid_cancer_train[col].cat.categories))
    print("-" * 50)
    
Column: Gender
Absolute Frequencies:
M     46
F    158
Name: count, dtype: int64

Normalized Frequencies:
M    0.22549
F    0.77451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     172
Yes     32
Name: count, dtype: int64

Normalized Frequencies:
No     0.843137
Yes    0.156863
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Smoking
Absolute Frequencies:
No     189
Yes     15
Name: count, dtype: int64

Normalized Frequencies:
No     0.926471
Yes    0.073529
Name: proportion, dtype: float64
--------------------------------------------------
Column: Hx_Radiotherapy
Absolute Frequencies:
No     199
Yes      5
Name: count, dtype: int64

Normalized Frequencies:
No     0.97549
Yes    0.02451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                      176
Subclinical Hypothyroidism       8
Subclinical Hyperthyroidism      3
Clinical Hypothyroidism          6
Clinical Hyperthyroidism        11
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                      0.862745
Subclinical Hypothyroidism     0.039216
Subclinical Hyperthyroidism    0.014706
Clinical Hypothyroidism        0.029412
Clinical Hyperthyroidism       0.053922
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Normal                          5
Single nodular goiter-left     47
Single nodular goiter-right    69
Multinodular goiter            78
Diffuse goiter                  5
Name: count, dtype: int64

Normalized Frequencies:
Normal                         0.024510
Single nodular goiter-left     0.230392
Single nodular goiter-right    0.338235
Multinodular goiter            0.382353
Diffuse goiter                 0.024510
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No           143
Left           8
Right         26
Bilateral     22
Posterior      1
Extensive      4
Name: count, dtype: int64

Normalized Frequencies:
No           0.700980
Left         0.039216
Right        0.127451
Bilateral    0.107843
Posterior    0.004902
Extensive    0.019608
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Hurthle Cell       11
Follicular         14
Micropapillary     28
Papillary         151
Name: count, dtype: int64

Normalized Frequencies:
Hurthle Cell      0.053922
Follicular        0.068627
Micropapillary    0.137255
Papillary         0.740196
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      118
Multi-Focal     86
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.578431
Multi-Focal    0.421569
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Low             131
Intermediate     55
High             18
Name: count, dtype: int64

Normalized Frequencies:
Low             0.642157
Intermediate    0.269608
High            0.088235
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1a    27
T1b    20
T2     79
T3a    55
T3b     6
T4a    11
T4b     6
Name: count, dtype: int64

Normalized Frequencies:
T1a    0.132353
T1b    0.098039
T2     0.387255
T3a    0.269608
T3b    0.029412
T4a    0.053922
T4b    0.029412
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0     140
N1a     13
N1b     51
Name: count, dtype: int64

Normalized Frequencies:
N0     0.686275
N1a    0.063725
N1b    0.250000
Name: proportion, dtype: float64
--------------------------------------------------
Column: M
Absolute Frequencies:
M0    192
M1     12
Name: count, dtype: int64

Normalized Frequencies:
M0    0.941176
M1    0.058824
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I      178
II      16
III      1
IVA      3
IVB      6
Name: count, dtype: int64

Normalized Frequencies:
I      0.872549
II     0.078431
III    0.004902
IVA    0.014706
IVB    0.029412
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                 107
Structural Incomplete      54
Biochemical Incomplete     17
Indeterminate              26
Name: count, dtype: int64

Normalized Frequencies:
Excellent                 0.524510
Structural Incomplete     0.264706
Biochemical Incomplete    0.083333
Indeterminate             0.127451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     143
Yes     61
Name: count, dtype: int64

Normalized Frequencies:
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
--------------------------------------------------
In [85]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
In [86]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 5
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 25))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image
In [87]:
##################################
# Removing predictors observed with extreme
# near-zero variance and a limited number of levels
##################################
thyroid_cancer_train_column_filtered = thyroid_cancer_train.drop(columns=['Hx_Radiotherapy','M','Hx_Smoking'])
thyroid_cancer_train_column_filtered.head()
Out[87]:
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response Recurred
335 29 M No Euthyroid Multinodular goiter Extensive Papillary Multi-Focal Intermediate T3a N1b I Structural Incomplete Yes
201 25 F No Euthyroid Single nodular goiter-right Right Papillary Multi-Focal Low T2 N1b I Excellent No
134 51 F No Euthyroid Multinodular goiter No Papillary Uni-Focal Low T2 N0 I Excellent No
35 37 F No Subclinical Hypothyroidism Single nodular goiter-left No Micropapillary Uni-Focal Low T1a N0 I Excellent No
380 72 M Yes Euthyroid Multinodular goiter Bilateral Papillary Multi-Focal High T4b N1b IVB Structural Incomplete Yes

1.4.3 Category Aggregration and Encoding ¶

  1. Category aggregation was applied to the previously identified categorical predictors observed with many levels (high-cardinality) containing only a few observations to improve model stability during cross-validation and enhance generalization:
    • Thyroid_Function:
      • 176 Thyroid_Function=Euthyroid: 86.27%
      • 28 Thyroid_Function=Hypothyroidism or Hyperthyroidism: 13.73%
    • Physical_Examination:
      • 121 Physical_Examination=Normal or Single Nodular Goiter : 59.31%
      • 83 Physical_Examination=Multinodular or Diffuse Goiter: 40.69%
    • Adenopathy:
      • 143 Adenopathy=No: 70.09%
      • 61 Adenopathy=Yes: 29.90%
    • Pathology:
      • 25 Pathology=Non-Papillary : 12.25%
      • 179 Pathology=Papillary: 87.74%
    • Risk:
      • 131 Risk=Low: 64.22%
      • 73 Risk=Intermediate to High: 35.78%
    • T:
      • 126 T=T1 to T2: 61.76%
      • 78 T=T3 to T4b: 38.23%
    • N:
      • 140 N=N0: 68.63%
      • 65 N=N1: 31.37%
    • Stage:
      • 178 Stage=I: 87.25%
      • 26 Stage=II to IVB: 12.75%
    • Response:
      • 107 Response=Excellent: 52.45%
      • 97 Response=Indeterminate or Incomplete: 47.55%
In [88]:
##################################
# Merging small categories into broader groups 
# for certain categorical predictors
# to ensure sufficient representation in statistical models 
# and prevent sparsity issues in cross-validation
##################################
thyroid_cancer_train_column_filtered['Thyroid_Function'] = thyroid_cancer_train_column_filtered['Thyroid_Function'].map(lambda x: 'Euthyroid' if (x in ['Euthyroid'])  else 'Hypothyroidism or Hyperthyroidism').astype('category')
thyroid_cancer_train_column_filtered['Physical_Examination'] = thyroid_cancer_train_column_filtered['Physical_Examination'].map(lambda x: 'Normal or Single Nodular Goiter' if (x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right'])  else 'Multinodular or Diffuse Goiter').astype('category')
thyroid_cancer_train_column_filtered['Adenopathy'] = thyroid_cancer_train_column_filtered['Adenopathy'].map(lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
thyroid_cancer_train_column_filtered['Pathology'] = thyroid_cancer_train_column_filtered['Pathology'].map(lambda x: 'Non-Papillary' if (x in ['Hurthle Cell', 'Follicular'])  else 'Papillary').astype('category')
thyroid_cancer_train_column_filtered['Risk'] = thyroid_cancer_train_column_filtered['Risk'].map(lambda x: 'Low' if (x in ['Low'])  else 'Intermediate to High').astype('category')
thyroid_cancer_train_column_filtered['T'] = thyroid_cancer_train_column_filtered['T'].map(lambda x: 'T1 to T2' if (x in ['T1a', 'T1b', 'T2'])  else 'T3 to T4b').astype('category')
thyroid_cancer_train_column_filtered['N'] = thyroid_cancer_train_column_filtered['N'].map(lambda x: 'N0' if (x in ['N0'])  else 'N1').astype('category')
thyroid_cancer_train_column_filtered['Stage'] = thyroid_cancer_train_column_filtered['Stage'].map(lambda x: 'I' if (x in ['I'])  else 'II to IVB').astype('category')
thyroid_cancer_train_column_filtered['Response'] = thyroid_cancer_train_column_filtered['Response'].map(lambda x: 'Indeterminate or Incomplete' if (x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete'])  else 'Excellent').astype('category')
thyroid_cancer_train_column_filtered.head()
Out[88]:
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response Recurred
335 29 M No Euthyroid Multinodular or Diffuse Goiter Yes Papillary Multi-Focal Intermediate to High T3 to T4b N1 I Indeterminate or Incomplete Yes
201 25 F No Euthyroid Normal or Single Nodular Goiter Yes Papillary Multi-Focal Low T1 to T2 N1 I Excellent No
134 51 F No Euthyroid Multinodular or Diffuse Goiter No Papillary Uni-Focal Low T1 to T2 N0 I Excellent No
35 37 F No Hypothyroidism or Hyperthyroidism Normal or Single Nodular Goiter No Papillary Uni-Focal Low T1 to T2 N0 I Excellent No
380 72 M Yes Euthyroid Multinodular or Diffuse Goiter Yes Papillary Multi-Focal Intermediate to High T3 to T4b N1 II to IVB Indeterminate or Incomplete Yes
In [89]:
##################################
# Performing a general exploration of the categorical variable levels
# based on the ordered categories
##################################
ordered_cat_cols = thyroid_cancer_train_column_filtered.select_dtypes(include=["category"]).columns
for col in ordered_cat_cols:
    print(f"Column: {col}")
    print("Absolute Frequencies:")
    print(thyroid_cancer_train_column_filtered[col].value_counts().reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
    print("\nNormalized Frequencies:")
    print(thyroid_cancer_train_column_filtered[col].value_counts(normalize=True).reindex(thyroid_cancer_train_column_filtered[col].cat.categories))
    print("-" * 50)
    
Column: Gender
Absolute Frequencies:
M     46
F    158
Name: count, dtype: int64

Normalized Frequencies:
M    0.22549
F    0.77451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Smoking
Absolute Frequencies:
No     172
Yes     32
Name: count, dtype: int64

Normalized Frequencies:
No     0.843137
Yes    0.156863
Name: proportion, dtype: float64
--------------------------------------------------
Column: Thyroid_Function
Absolute Frequencies:
Euthyroid                            176
Hypothyroidism or Hyperthyroidism     28
Name: count, dtype: int64

Normalized Frequencies:
Euthyroid                            0.862745
Hypothyroidism or Hyperthyroidism    0.137255
Name: proportion, dtype: float64
--------------------------------------------------
Column: Physical_Examination
Absolute Frequencies:
Multinodular or Diffuse Goiter      83
Normal or Single Nodular Goiter    121
Name: count, dtype: int64

Normalized Frequencies:
Multinodular or Diffuse Goiter     0.406863
Normal or Single Nodular Goiter    0.593137
Name: proportion, dtype: float64
--------------------------------------------------
Column: Adenopathy
Absolute Frequencies:
No     143
Yes     61
Name: count, dtype: int64

Normalized Frequencies:
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
--------------------------------------------------
Column: Pathology
Absolute Frequencies:
Non-Papillary     25
Papillary        179
Name: count, dtype: int64

Normalized Frequencies:
Non-Papillary    0.122549
Papillary        0.877451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Focality
Absolute Frequencies:
Uni-Focal      118
Multi-Focal     86
Name: count, dtype: int64

Normalized Frequencies:
Uni-Focal      0.578431
Multi-Focal    0.421569
Name: proportion, dtype: float64
--------------------------------------------------
Column: Risk
Absolute Frequencies:
Intermediate to High     73
Low                     131
Name: count, dtype: int64

Normalized Frequencies:
Intermediate to High    0.357843
Low                     0.642157
Name: proportion, dtype: float64
--------------------------------------------------
Column: T
Absolute Frequencies:
T1 to T2     126
T3 to T4b     78
Name: count, dtype: int64

Normalized Frequencies:
T1 to T2     0.617647
T3 to T4b    0.382353
Name: proportion, dtype: float64
--------------------------------------------------
Column: N
Absolute Frequencies:
N0    140
N1     64
Name: count, dtype: int64

Normalized Frequencies:
N0    0.686275
N1    0.313725
Name: proportion, dtype: float64
--------------------------------------------------
Column: Stage
Absolute Frequencies:
I            178
II to IVB     26
Name: count, dtype: int64

Normalized Frequencies:
I            0.872549
II to IVB    0.127451
Name: proportion, dtype: float64
--------------------------------------------------
Column: Response
Absolute Frequencies:
Excellent                      107
Indeterminate or Incomplete     97
Name: count, dtype: int64

Normalized Frequencies:
Excellent                      0.52451
Indeterminate or Incomplete    0.47549
Name: proportion, dtype: float64
--------------------------------------------------
Column: Recurred
Absolute Frequencies:
No     143
Yes     61
Name: count, dtype: int64

Normalized Frequencies:
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
--------------------------------------------------
In [90]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
In [91]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_predictors_categorical
proportion_x_variable = 'Recurred'
In [92]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.4.4 Outlier and Distributional Shape Analysis ¶

  1. No outliers (Outlier.Count>0, Outlier.Ratio>0.000), high skewness (Skewness>3 or Skewness<(-3)) or abnormal kurtosis (Skewness>2 or Skewness<(-2)) observed for the numeric predictor.
    • Age: Outlier.Count = 0, Outlier.Ratio = 0.000, Skewness = 0.592, Kurtosis = -0.461
In [93]:
##################################
# Formulating the imputed dataset
# with numeric columns only
##################################
thyroid_cancer_train_column_filtered['Age'] = pd.to_numeric(thyroid_cancer_train_column_filtered['Age'])
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered.select_dtypes(include='number')
thyroid_cancer_train_column_filtered_numeric = thyroid_cancer_train_column_filtered_numeric.to_frame() if isinstance(thyroid_cancer_train_column_filtered_numeric, pd.Series) else thyroid_cancer_train_column_filtered_numeric
In [94]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(thyroid_cancer_train_column_filtered_numeric.columns)
In [95]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
In [96]:
##################################
# Computing the interquartile range
# for all columns
##################################
thyroid_cancer_train_column_filtered_numeric_q1 = thyroid_cancer_train_column_filtered_numeric.quantile(0.25)
thyroid_cancer_train_column_filtered_numeric_q3 = thyroid_cancer_train_column_filtered_numeric.quantile(0.75)
thyroid_cancer_train_column_filtered_numeric_iqr = thyroid_cancer_train_column_filtered_numeric_q3 - thyroid_cancer_train_column_filtered_numeric_q1
In [97]:
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((thyroid_cancer_train_column_filtered_numeric < (thyroid_cancer_train_column_filtered_numeric_q1 - 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr)) | (thyroid_cancer_train_column_filtered_numeric > (thyroid_cancer_train_column_filtered_numeric_q3 + 1.5 * thyroid_cancer_train_column_filtered_numeric_iqr))).sum() 
In [98]:
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(thyroid_cancer_train_column_filtered_numeric)] * len(thyroid_cancer_train_column_filtered_numeric.columns))
In [99]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
In [100]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = thyroid_cancer_train_column_filtered_numeric.skew()
In [101]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = thyroid_cancer_train_column_filtered_numeric.kurtosis()
In [102]:
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                  numeric_skewness_list,
                                                  numeric_outlier_count_list,
                                                  numeric_row_count_list,
                                                  numeric_outlier_ratio_list,
                                                  numeric_skewness_list,
                                                  numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Skewness',
                                                 'Outlier.Count',
                                                 'Row.Count',
                                                 'Outlier.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_outlier_summary)
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio Skewness Kurtosis
0 Age 0.592572 0 204 0.0 0.592572 -0.461287
In [103]:
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in thyroid_cancer_train_column_filtered_numeric:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=thyroid_cancer_train_column_filtered_numeric, x=column)
    
No description has been provided for this image

1.4.5 Collinearity ¶

  1. Majority of the predictors reported low (<0.50) to moderate (0.50 to 0.75) correlation.
  2. Among pairwise combinations of categorical predictors, high Phi.Coefficient values were noted for:
    • N and Adenopathy: Phi.Coefficient = +0.827
    • N and Risk: Phi.Coefficient = +0.751
    • Adenopathy and Risk: Phi.Coefficient = +0.696
In [104]:
##################################
# Creating a dataset copy and
# converting all values to numeric
# for correlation analysis
##################################
pd.set_option('future.no_silent_downcasting', True)
thyroid_cancer_train_correlation = thyroid_cancer_train_column_filtered.copy()
thyroid_cancer_train_correlation_object = thyroid_cancer_train_correlation.iloc[:,1:13].columns
custom_category_orders = {
    'Gender': ['M', 'F'],  
    'Smoking': ['No', 'Yes'],  
    'Thyroid_Function': ['Euthyroid', 'Hypothyroidism or Hyperthyroidism'],  
    'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],  
    'Adenopathy': ['No', 'Yes'],  
    'Pathology': ['Non-Papillary', 'Papillary'],  
    'Focality': ['Uni-Focal', 'Multi-Focal'],  
    'Risk': ['Low', 'Intermediate to High'],  
    'T': ['T1 to T2', 'T3 to T4b'],  
    'N': ['N0', 'N1'],  
    'Stage': ['I', 'II to IVB'],  
    'Response': ['Excellent', 'Indeterminate or Incomplete'] 
}
encoder = OrdinalEncoder(categories=[custom_category_orders[col] for col in thyroid_cancer_train_correlation_object])
thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object] = encoder.fit_transform(
    thyroid_cancer_train_correlation[thyroid_cancer_train_correlation_object]
)
thyroid_cancer_train_correlation = thyroid_cancer_train_correlation.drop(['Recurred'], axis=1)
display(thyroid_cancer_train_correlation)
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response
335 29 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0
201 25 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0
134 51 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
35 37 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
380 72 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 31 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
231 21 1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0
297 61 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0
270 39 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
302 67 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0

204 rows × 13 columns

In [105]:
##################################
# Initializing the correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
                                                       columns=thyroid_cancer_train_correlation.columns,
                                                       index=thyroid_cancer_train_correlation.columns)
In [106]:
##################################
# Creating an empty correlation matrix
##################################
thyroid_cancer_train_correlation_matrix = pd.DataFrame(
    np.zeros((len(thyroid_cancer_train_correlation.columns), len(thyroid_cancer_train_correlation.columns))),
    index=thyroid_cancer_train_correlation.columns,
    columns=thyroid_cancer_train_correlation.columns
)


##################################
# Calculating different types
# of correlation coefficients
# per variable type
##################################
for i in range(len(thyroid_cancer_train_correlation.columns)):
    for j in range(i, len(thyroid_cancer_train_correlation.columns)):
        if i == j:
            thyroid_cancer_train_correlation_matrix.iloc[i, j] = 1.0  
        else:
            col_i = thyroid_cancer_train_correlation.iloc[:, i]
            col_j = thyroid_cancer_train_correlation.iloc[:, j]

            # Detecting binary variables (assumes binary variables are coded as 0/1)
            is_binary_i = col_i.nunique() == 2
            is_binary_j = col_j.nunique() == 2

            # Computing the Pearson correlation for two continuous variables
            if col_i.dtype in ['int64', 'float64'] and col_j.dtype in ['int64', 'float64']:
                corr = col_i.corr(col_j)

            # Computing the Point-Biserial correlation for continuous and binary variables
            elif (col_i.dtype in ['int64', 'float64'] and is_binary_j) or (col_j.dtype in ['int64', 'float64'] and is_binary_i):
                continuous_var = col_i if col_i.dtype in ['int64', 'float64'] else col_j
                binary_var = col_j if is_binary_j else col_i

                # Convert binary variable to 0/1 (if not already)
                binary_var = binary_var.astype('category').cat.codes
                corr, _ = pointbiserialr(continuous_var, binary_var)

            # Computing the Phi coefficient for two binary variables
            elif is_binary_i and is_binary_j:
                corr = col_i.corr(col_j) 

            # Computing the Cramér's V for two categorical variables (if more than 2 categories)
            else:
                contingency_table = pd.crosstab(col_i, col_j)
                chi2, _, _, _ = chi2_contingency(contingency_table)
                n = contingency_table.sum().sum()
                phi2 = chi2 / n
                r, k = contingency_table.shape
                corr = np.sqrt(phi2 / min(k - 1, r - 1))  # Cramér's V formula

            # Assigning correlation values to the matrix
            thyroid_cancer_train_correlation_matrix.iloc[i, j] = corr
            thyroid_cancer_train_correlation_matrix.iloc[j, i] = corr

# Displaying the correlation matrix
display(thyroid_cancer_train_correlation_matrix)
            
Age Gender Smoking Thyroid_Function Physical_Examination Adenopathy Pathology Focality Risk T N Stage Response
Age 1.000000 -0.194235 0.384793 -0.064362 0.166593 0.075158 -0.103521 0.193520 0.218433 0.230384 0.032385 0.548657 0.277894
Gender -0.194235 1.000000 -0.605869 -0.023393 -0.126177 -0.262486 0.048746 -0.204469 -0.331298 -0.178901 -0.292449 -0.286223 -0.261361
Smoking 0.384793 -0.605869 1.000000 0.023809 0.136654 0.307114 -0.208749 0.232285 0.352861 0.243106 0.260305 0.522287 0.345057
Thyroid_Function -0.064362 -0.023393 0.023809 1.000000 0.075621 -0.073821 -0.068143 -0.052038 -0.030299 -0.079318 -0.054779 -0.024290 -0.151571
Physical_Examination 0.166593 -0.126177 0.136654 0.075621 1.000000 0.156521 0.035651 0.464966 0.276835 0.190238 0.192704 0.102383 0.170525
Adenopathy 0.075158 -0.262486 0.307114 -0.073821 0.156521 1.000000 0.015525 0.309718 0.696240 0.521653 0.827536 0.360418 0.514449
Pathology -0.103521 0.048746 -0.208749 -0.068143 0.035651 0.015525 1.000000 -0.104765 -0.064050 -0.228898 0.091596 -0.081303 -0.212909
Focality 0.193520 -0.204469 0.232285 -0.052038 0.464966 0.309718 -0.104765 1.000000 0.439542 0.451800 0.364112 0.269075 0.379817
Risk 0.218433 -0.331298 0.352861 -0.030299 0.276835 0.696240 -0.064050 0.439542 1.000000 0.654179 0.751465 0.511978 0.620217
T 0.230384 -0.178901 0.243106 -0.079318 0.190238 0.521653 -0.228898 0.451800 0.654179 1.000000 0.489771 0.425255 0.583975
N 0.032385 -0.292449 0.260305 -0.054779 0.192704 0.827536 0.091596 0.364112 0.751465 0.489771 1.000000 0.375186 0.519732
Stage 0.548657 -0.286223 0.522287 -0.024290 0.102383 0.360418 -0.081303 0.269075 0.511978 0.425255 0.375186 1.000000 0.371970
Response 0.277894 -0.261361 0.345057 -0.151571 0.170525 0.514449 -0.212909 0.379817 0.620217 0.583975 0.519732 0.371970 1.000000
In [107]:
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric and categorical columns
##################################
plt.figure(figsize=(17, 8))
sns.heatmap(thyroid_cancer_train_correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
No description has been provided for this image

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis ¶

  1. Bivariate analysis identified individual predictors with generally positive association to the target variable based on visual inspection.
  2. Higher values or higher proportions for the following predictors are associated with the Recurred=Yescategory:
    • Age
    • Gender=M
    • Smoking=Yes
    • Physical_Examination=Multinodular or Diffuse Goiter
    • Adenopathy=Yes
    • Focality=Multi-Focal
    • Risk=Intermediate to High
    • T=T3 to T4b
    • N=N1
    • Stage=II to IVB
    • Response=Indeterminate or Incomplete
  3. Proportions for the following predictors are not associated with the Recurred=Yes or Recurred=No categories:
    • Thyroid_Function
    • Pathology
In [108]:
##################################
# Segregating the target
# and predictor variables
##################################
thyroid_cancer_train_column_filtered_predictors = thyroid_cancer_train_column_filtered.iloc[:,:-1].columns
thyroid_cancer_train_column_filtered_predictors_numeric = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:, thyroid_cancer_train_column_filtered.iloc[:,:-1].columns == 'Age'].columns
thyroid_cancer_train_column_filtered_predictors_categorical = thyroid_cancer_train_column_filtered.iloc[:,:-1].loc[:,thyroid_cancer_train_column_filtered.iloc[:,:-1].columns != 'Age'].columns
In [109]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = thyroid_cancer_train_column_filtered_predictors_numeric
In [110]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
boxplot_y_variable = 'Recurred'
boxplot_x_variable = numeric_variable_name_list.values[0]
In [111]:
##################################
# Evaluating the numeric predictors
# against the target variable
##################################
plt.figure(figsize=(7, 5))
plt.boxplot([group[boxplot_x_variable] for name, group in thyroid_cancer_train_column_filtered.groupby(boxplot_y_variable, observed=True)])
plt.title(f'{boxplot_y_variable} Versus {boxplot_x_variable}')
plt.xlabel(boxplot_y_variable)
plt.ylabel(boxplot_x_variable)
plt.xticks(range(1, len(thyroid_cancer_train_column_filtered[boxplot_y_variable].unique()) + 1), ['No', 'Yes'])
plt.show()
No description has been provided for this image
In [112]:
##################################
# Segregating the target variable
# and categorical predictors
##################################
proportion_y_variables = thyroid_cancer_train_column_filtered_predictors_categorical
proportion_x_variable = 'Recurred'
In [113]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 4
num_cols = 3

##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 20))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual stacked column plots
# for all categorical columns
##################################
for i, y_variable in enumerate(proportion_y_variables):
    ax = axes[i]
    category_counts = thyroid_cancer_train_column_filtered.groupby([proportion_x_variable, y_variable], observed=True).size().unstack(fill_value=0)
    category_proportions = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_proportions.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title(f'{proportion_x_variable} Versus {y_variable}')
    ax.set_xlabel(proportion_x_variable)
    ax.set_ylabel('Proportions')
    ax.legend(loc="lower center")

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.5.2 Hypothesis Testing ¶

  1. The relationship between the numeric predictor to the Recurred target variable was statistically evaluated using the following hypotheses:
    • Null: Difference in the means between groups Yes and No is equal to zero
    • Alternative: Difference in the means between groups Yes and No is not equal to zero
  2. There is sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from Yes and No groups of the Recurred target variable in 1 of 1 numeric predictor given its high t-test statistic values with reported low p-values less than the significance level of 0.05.
    • Age: T.Test.Statistic=-3.791, T.Test.PValue=0.000
  3. The relationship between the categorical predictors to the Recurred target variable was statistically evaluated using the following hypotheses:
    • Null: The categorical predictor is independent of the categorical target variable
    • Alternative: The categorical predictor is dependent of the categorical target variable
  4. There is sufficient evidence to conclude of a statistically significant relationship between the categories of the categorical predictors and the Yes and No groups of the Recurred target variable in 10 of 12 categorical predictors given their high chisquare statistic values with reported low p-values less than the significance level of 0.05.
    • Risk: ChiSquare.Test.Statistic=115.387, ChiSquare.Test.PValue=0.000
    • Response: ChiSquare.Test.Statistic=93.015, ChiSquare.Test.PValue=0.000
    • N: ChiSquare.Test.Statistic=87.380, ChiSquare.Test.PValue=0.001
    • Adenopathy: ChiSquare.Test.Statistic=82.909, ChiSquare.Test.PValue=0.002
    • Stage: ChiSquare.Test.Statistic=58.828, ChiSquare.Test.PValue=0.000
    • T: ChiSquare.Test.Statistic=57.882, ChiSquare.Test.PValue=0.000
    • Smoking: ChiSquare.Test.Statistic=34.318, ChiSquare.Test.PValue=0.001
    • Gender: ChiSquare.Test.Statistic=29.114, ChiSquare.Test.PValue=0.002
    • Focality: ChiSquare.Test.Statistic=27.017, ChiSquare.Test.PValue=0.001
    • Physical_Examination: ChiSquare.Test.Statistic=5.717, ChiSquare.Test.PValue=0.016
In [114]:
##################################
# Computing the t-test 
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_ttest_target = {}
thyroid_cancer_numeric = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns == 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_numeric_columns = thyroid_cancer_train_column_filtered_predictors_numeric
for numeric_column in thyroid_cancer_numeric_columns:
    group_0 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='No']
    group_1 = thyroid_cancer_numeric[thyroid_cancer_numeric.loc[:,'Recurred']=='Yes']
    thyroid_cancer_numeric_ttest_target['Recurred_' + numeric_column] = stats.ttest_ind(
        group_0[numeric_column], 
        group_1[numeric_column], 
        equal_var=True)
In [115]:
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
thyroid_cancer_numeric_summary = thyroid_cancer_numeric.from_dict(thyroid_cancer_numeric_ttest_target, orient='index')
thyroid_cancer_numeric_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(thyroid_cancer_numeric_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_numeric)))
T.Test.Statistic T.Test.PValue
Recurred_Age -3.791048 0.000198
In [116]:
##################################
# Computing the chisquare
# statistic and p-values
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_chisquare_target = {}
thyroid_cancer_categorical = thyroid_cancer_train_column_filtered.loc[:,(thyroid_cancer_train_column_filtered.columns != 'Age') | (thyroid_cancer_train_column_filtered.columns == 'Recurred')]
thyroid_cancer_categorical_columns = thyroid_cancer_train_column_filtered_predictors_categorical
for categorical_column in thyroid_cancer_categorical_columns:
    contingency_table = pd.crosstab(thyroid_cancer_categorical[categorical_column], 
                                    thyroid_cancer_categorical['Recurred'])
    thyroid_cancer_categorical_chisquare_target['Recurred_' + categorical_column] = stats.chi2_contingency(
        contingency_table)[0:2]
In [117]:
##################################
# Formulating the pairwise chisquare summary
# between the target variable
# and categorical predictor columns
##################################
thyroid_cancer_categorical_summary = thyroid_cancer_categorical.from_dict(thyroid_cancer_categorical_chisquare_target, orient='index')
thyroid_cancer_categorical_summary.columns = ['ChiSquare.Test.Statistic', 'ChiSquare.Test.PValue']
display(thyroid_cancer_categorical_summary.sort_values(by=['ChiSquare.Test.PValue'], ascending=True).head(len(thyroid_cancer_train_column_filtered_predictors_categorical)))
ChiSquare.Test.Statistic ChiSquare.Test.PValue
Recurred_Risk 115.387077 6.474266e-27
Recurred_Response 93.015440 5.188795e-22
Recurred_N 87.380222 8.954142e-21
Recurred_Adenopathy 82.909484 8.589806e-20
Recurred_Stage 58.828665 1.720169e-14
Recurred_T 57.882234 2.782892e-14
Recurred_Smoking 34.318952 4.678040e-09
Recurred_Gender 29.114212 6.823460e-08
Recurred_Focality 27.017885 2.015816e-07
Recurred_Physical_Examination 5.717930 1.679252e-02
Recurred_Thyroid_Function 2.961746 8.525584e-02
Recurred_Pathology 0.891397 3.450989e-01

1.6. Premodelling Data Preparation ¶

1.6.1 Preprocessed Data Description¶

  1. A total of 6 of the 16 predictors were excluded from the dataset based on the data preprocessing and exploration findings
  2. There were 3 categorical predictors excluded from the dataset after having been observed with extremely low variance containing categories with very few or almost no variations across observations that may have limited predictive power or drive increased model complexity without performance gains:
    • Hx_Smoking:
      • 189 Hx_Smoking=No: 92.65%
      • 15 Hx_Smoking=Yes: 7.35%
    • Hx_Radiotherapy:
      • 199 Hx_Radiotherapy=No: 97.55%
      • 15 Hx_Radiotherapy=Yes: 2.45%
    • M:
      • 192 M=M0: 94.12%
      • 12 M=M1: 5.88%
  3. There was 1 categorical predictor excluded from the dataset after having been observed with high pairwise collinearity (Phi.Coefficient>0.70) with other 2 predictors that might provide redundant information, leading to potential instability in regression models.
    • N was highly associated with Adenopathy: Phi.Coefficient = +0.827
    • N was highly associated with Risk: Phi.Coefficient = +0.751
  4. Another 2 categorical predictors were excluded from the dataset for not exhibiting a statistically significant association with the Yes and No groups of the Recurred target variable, indicating weak predictive value.
    • Thyroid_Function: ChiSquare.Test.Statistic=2.962, ChiSquare.Test.PValue=0.085
    • Pathology: ChiSquare.Test.Statistic=0.891, ChiSquare.Test.PValue=0.345
  5. The preprocessed train data (final) subset is comprised of:
    • 204 rows (observations)
      • 143 Recurred=No: 70.10%
      • 61 Recurred=Yes: 29.90%
    • 11 columns (variables)
      • 1/11 target (categorical)
        • Recurred
      • 1/11 predictor (numeric)
        • Age
      • 9/11 predictor (categorical)
        • Gender
        • Smoking
        • Physical_Examination
        • Adenopathy
        • Focality
        • Risk
        • T
        • M
        • Stage
        • Response

1.6.2 Preprocessing Pipeline Development¶

  1. A preprocessing pipeline was formulated and applied to the train data (final), validation data and test data with the following actions:
    • Excluded specified columns noted with low variance, high collinearity and weak predictive power
    • Aggregated categories in multiclass categorical variables into binary levels
    • Converted categorical columns to the appropriate type
    • Set the order of category levels for ordinal encoding during modeling pipeline creation
In [118]:
##################################
# Formulating a preprocessing pipeline
# that removes the specified columns,
# aggregates categories in multiclass categorical variables,
# converts categorical columns to the appropriate type, and
# sets the order of category levels
##################################
def preprocess_dataset(df):
    # Removing the specified columns
    columns_to_remove = ['Hx_Smoking', 'Hx_Radiotherapy', 'M', 'N', 'Thyroid_Function', 'Pathology']
    df = df.drop(columns=columns_to_remove)
    
    # Applying category aggregation
    df['Physical_Examination'] = df['Physical_Examination'].map(
        lambda x: 'Normal or Single Nodular Goiter' if x in ['Normal', 'Single nodular goiter-left', 'Single nodular goiter-right'] 
        else 'Multinodular or Diffuse Goiter').astype('category')
    
    df['Adenopathy'] = df['Adenopathy'].map(
        lambda x: 'No' if x == 'No' else ('Yes' if pd.notna(x) and x != '' else x)).astype('category')
    
    df['Risk'] = df['Risk'].map(
        lambda x: 'Low' if x == 'Low' else 'Intermediate to High').astype('category')
    
    df['T'] = df['T'].map(
        lambda x: 'T1 to T2' if x in ['T1a', 'T1b', 'T2'] else 'T3 to T4b').astype('category')
    
    df['Stage'] = df['Stage'].map(
        lambda x: 'I' if x == 'I' else 'II to IVB').astype('category')
    
    df['Response'] = df['Response'].map(
        lambda x: 'Indeterminate or Incomplete' if x in ['Indeterminate', 'Structural Incomplete', 'Biochemical Incomplete'] 
        else 'Excellent').astype('category')
    
    # Setting category levels
    category_mappings = {
        'Gender': ['M', 'F'],
        'Smoking': ['No', 'Yes'],
        'Physical_Examination': ['Normal or Single Nodular Goiter', 'Multinodular or Diffuse Goiter'],
        'Adenopathy': ['No', 'Yes'],
        'Focality': ['Uni-Focal', 'Multi-Focal'],
        'Risk': ['Low', 'Intermediate to High'],
        'T': ['T1 to T2', 'T3 to T4b'],
        'Stage': ['I', 'II to IVB'],
        'Response': ['Excellent', 'Indeterminate or Incomplete']
    }
    
    for col, categories in category_mappings.items():
        df[col] = df[col].astype('category')
        df[col] = df[col].cat.set_categories(categories, ordered=True)
    
    return df
    
In [119]:
##################################
# Applying the preprocessing pipeline
# to the train data
##################################
thyroid_cancer_preprocessed_train = preprocess_dataset(thyroid_cancer_train)
X_preprocessed_train = thyroid_cancer_preprocessed_train.drop('Recurred', axis = 1)
y_preprocessed_train = thyroid_cancer_preprocessed_train['Recurred']
thyroid_cancer_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_PATH, "thyroid_cancer_preprocessed_train.csv"), index=False)
X_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH, "X_preprocessed_train.csv"), index=False)
y_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_TARGET_PATH, "y_preprocessed_train.csv"), index=False)
print('Final Preprocessed Train Dataset Dimensions: ')
display(X_preprocessed_train.shape)
display(y_preprocessed_train.shape)
print('Final Preprocessed Train Target Variable Breakdown: ')
display(y_preprocessed_train.value_counts())
print('Final Preprocessed Train Target Variable Proportion: ')
display(y_preprocessed_train.value_counts(normalize = True))
thyroid_cancer_preprocessed_train.head()
Final Preprocessed Train Dataset Dimensions: 
(204, 10)
(204,)
Final Preprocessed Train Target Variable Breakdown: 
Recurred
No     143
Yes     61
Name: count, dtype: int64
Final Preprocessed Train Target Variable Proportion: 
Recurred
No     0.70098
Yes    0.29902
Name: proportion, dtype: float64
Out[119]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
335 29 M No Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b I Indeterminate or Incomplete Yes
201 25 F No Normal or Single Nodular Goiter Yes Multi-Focal Low T1 to T2 I Excellent No
134 51 F No Multinodular or Diffuse Goiter No Uni-Focal Low T1 to T2 I Excellent No
35 37 F No Normal or Single Nodular Goiter No Uni-Focal Low T1 to T2 I Excellent No
380 72 M Yes Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete Yes
In [120]:
##################################
# Applying the preprocessing pipeline
# to the validation data
##################################
thyroid_cancer_preprocessed_validation = preprocess_dataset(thyroid_cancer_validation)
X_preprocessed_validation = thyroid_cancer_preprocessed_validation.drop('Recurred', axis = 1)
y_preprocessed_validation = thyroid_cancer_preprocessed_validation['Recurred']
thyroid_cancer_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_PATH, "thyroid_cancer_preprocessed_validation.csv"), index=False)
X_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH, "X_preprocessed_validation.csv"), index=False)
y_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH, "y_preprocessed_validation.csv"), index=False)
print('Final Preprocessed Validation Dataset Dimensions: ')
display(X_preprocessed_validation.shape)
display(y_preprocessed_validation.shape)
print('Final Preprocessed Validation Target Variable Breakdown: ')
display(y_preprocessed_validation.value_counts())
print('Final Preprocessed Validation Target Variable Proportion: ')
display(y_preprocessed_validation.value_counts(normalize = True))
thyroid_cancer_preprocessed_validation.head()
Final Preprocessed Validation Dataset Dimensions: 
(69, 10)
(69,)
Final Preprocessed Validation Target Variable Breakdown: 
Recurred
No     49
Yes    20
Name: count, dtype: int64
Final Preprocessed Validation Target Variable Proportion: 
Recurred
No     0.710145
Yes    0.289855
Name: proportion, dtype: float64
Out[120]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
49 29 F No Multinodular or Diffuse Goiter No Uni-Focal Low T1 to T2 I Excellent No
353 73 F No Normal or Single Nodular Goiter Yes Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete Yes
204 36 F No Normal or Single Nodular Goiter Yes Uni-Focal Low T1 to T2 I Excellent No
283 30 F No Normal or Single Nodular Goiter No Multi-Focal Intermediate to High T3 to T4b I Excellent No
254 31 M Yes Normal or Single Nodular Goiter No Uni-Focal Low T3 to T4b I Indeterminate or Incomplete No
In [121]:
##################################
# Applying the preprocessing pipeline
# to the test data
##################################
thyroid_cancer_preprocessed_test = preprocess_dataset(thyroid_cancer_test)
X_preprocessed_test = thyroid_cancer_preprocessed_test.drop('Recurred', axis = 1)
y_preprocessed_test = thyroid_cancer_preprocessed_test['Recurred']
thyroid_cancer_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_PATH, "thyroid_cancer_preprocessed_test.csv"), index=False)
X_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_FEATURES_PATH, "X_preprocessed_test.csv"), index=False)
y_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_TARGET_PATH, "y_preprocessed_test.csv"), index=False)
print('Final Preprocessed Test Dataset Dimensions: ')
display(X_preprocessed_test.shape)
display(y_preprocessed_test.shape)
print('Final Preprocessed Test Target Variable Breakdown: ')
display(y_preprocessed_test.value_counts())
print('Final Preprocessed Test Target Variable Proportion: ')
display(y_preprocessed_test.value_counts(normalize = True))
thyroid_cancer_preprocessed_test.head()
Final Preprocessed Test Dataset Dimensions: 
(91, 10)
(91,)
Final Preprocessed Test Target Variable Breakdown: 
Recurred
No     64
Yes    27
Name: count, dtype: int64
Final Preprocessed Test Target Variable Proportion: 
Recurred
No     0.703297
Yes    0.296703
Name: proportion, dtype: float64
Out[121]:
Age Gender Smoking Physical_Examination Adenopathy Focality Risk T Stage Response Recurred
379 81 M Yes Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete Yes
125 31 F No Normal or Single Nodular Goiter Yes Uni-Focal Low T1 to T2 I Excellent No
286 58 F No Multinodular or Diffuse Goiter No Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete No
244 35 F No Multinodular or Diffuse Goiter No Uni-Focal Low T3 to T4b I Excellent No
369 71 M Yes Multinodular or Diffuse Goiter Yes Multi-Focal Intermediate to High T3 to T4b II to IVB Indeterminate or Incomplete Yes

1.7. Bagged Model Development ¶

1.7.1 Random Forest ¶

In [122]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [123]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_rf_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_rf_model', RandomForestClassifier(class_weight='balanced', 
                                               random_state=88888888))
])
In [124]:
##################################
# Defining hyperparameter grid
##################################
bagged_rf_hyperparameter_grid = {
    'bagged_rf_model__criterion': ['gini', 'entropy'],
    'bagged_rf_model__max_depth': [3, 5],
    'bagged_rf_model__min_samples_leaf': [5, 10],
    'bagged_rf_model__n_estimators': [100, 200]
}
In [125]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [126]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_rf_grid_search = GridSearchCV(
    estimator=bagged_rf_pipeline,
    param_grid=bagged_rf_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [127]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [128]:
##################################
# Fitting GridSearchCV
##################################
bagged_rf_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[128]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_rf_model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'],
                         'bagged_rf_model__max_depth': [3, 5],
                         'bagged_rf_model__min_samples_leaf': [5, 10],
                         'bagged_rf_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_rf_model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'],
                         'bagged_rf_model__max_depth': [3, 5],
                         'bagged_rf_model__min_samples_leaf': [5, 10],
                         'bagged_rf_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_rf_model',
                 RandomForestClassifier(class_weight='balanced',
                                        criterion='entropy', max_depth=5,
                                        min_samples_leaf=5,
                                        random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=5, min_samples_leaf=5, random_state=88888888)
In [129]:
##################################
# Identifying the best model
##################################
bagged_rf_optimal = bagged_rf_grid_search.best_estimator_
In [130]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_rf_optimal_f1_cv = bagged_rf_grid_search.best_score_
bagged_rf_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
bagged_rf_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
In [131]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model - Random Forest: ')
print(f"Best Random Forest Hyperparameters: {bagged_rf_grid_search.best_params_}")
Best Bagged Model - Random Forest: 
Best Random Forest Hyperparameters: {'bagged_rf_model__criterion': 'entropy', 'bagged_rf_model__max_depth': 5, 'bagged_rf_model__min_samples_leaf': 5, 'bagged_rf_model__n_estimators': 100}
In [132]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_rf_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_rf_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8709
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [133]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_rf_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [134]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_rf_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [135]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_rf_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [136]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_rf_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_random_forest_optimal.pkl"))
Out[136]:
['..\\models\\bagged_model_random_forest_optimal.pkl']

1.7.2 Extra Trees ¶

In [137]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [138]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_et_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_et_model', ExtraTreesClassifier(class_weight='balanced', 
                                               random_state=88888888))
])
In [139]:
##################################
# Defining hyperparameter grid
##################################
bagged_et_hyperparameter_grid = {
    'bagged_et_model__criterion': ['gini', 'entropy'],
    'bagged_et_model__max_depth': [3, 6],
    'bagged_et_model__min_samples_leaf': [5, 10],
    'bagged_et_model__n_estimators': [100, 200]
}
In [140]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [141]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_et_grid_search = GridSearchCV(
    estimator=bagged_et_pipeline,
    param_grid=bagged_et_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [142]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [143]:
##################################
# Fitting GridSearchCV
##################################
bagged_et_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[143]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_et_model',
                                        ExtraTreesClassifier(class_weight='balanced',
                                                             random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_et_model__criterion': ['gini', 'entropy'],
                         'bagged_et_model__max_depth': [3, 6],
                         'bagged_et_model__min_samples_leaf': [5, 10],
                         'bagged_et_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('bagged_et_model',
                                        ExtraTreesClassifier(class_weight='balanced',
                                                             random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_et_model__criterion': ['gini', 'entropy'],
                         'bagged_et_model__max_depth': [3, 6],
                         'bagged_et_model__min_samples_leaf': [5, 10],
                         'bagged_et_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_et_model',
                 ExtraTreesClassifier(class_weight='balanced', max_depth=3,
                                      min_samples_leaf=5,
                                      random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
ExtraTreesClassifier(class_weight='balanced', max_depth=3, min_samples_leaf=5,
                     random_state=88888888)
In [144]:
##################################
# Identifying the best model
##################################
bagged_et_optimal = bagged_et_grid_search.best_estimator_
In [145]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_et_optimal_f1_cv = bagged_et_grid_search.best_score_
bagged_et_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
bagged_et_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
In [146]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Extra Trees: ')
print(f"Best Extra Trees Hyperparameters: {bagged_et_grid_search.best_params_}")
Best Bagged Model – Extra Trees: 
Best Extra Trees Hyperparameters: {'bagged_et_model__criterion': 'gini', 'bagged_et_model__max_depth': 3, 'bagged_et_model__min_samples_leaf': 5, 'bagged_et_model__n_estimators': 100}
In [147]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_et_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_et_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8760
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [148]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_et_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [149]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_et_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [150]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_et_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Extra Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [151]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_et_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_extra_trees_optimal.pkl"))
Out[151]:
['..\\models\\bagged_model_extra_trees_optimal.pkl']

1.7.3 Bagged Decision Tree ¶

In [152]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [153]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bdt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_bdt_model', BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced', 
                                                                            random_state=88888888),
                                           random_state=88888888))
])
In [154]:
##################################
# Defining hyperparameter grid
##################################
bagged_bdt_hyperparameter_grid = {
    'bagged_bdt_model__estimator__criterion': ['gini', 'entropy'],
    'bagged_bdt_model__estimator__max_depth': [3, 6],
    'bagged_bdt_model__estimator__min_samples_leaf': [5, 10],
    'bagged_bdt_model__n_estimators': [100, 200]
}
In [155]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [156]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bdt_grid_search = GridSearchCV(
    estimator=bagged_bdt_pipeline,
    param_grid=bagged_bdt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [157]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [158]:
##################################
# Fitting GridSearchCV
##################################
bagged_bdt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[158]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                           random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_bdt_model__estimator__criterion': ['gini',
                                                                    'entropy'],
                         'bagged_bdt_model__estimator__max_depth': [3, 6],
                         'bagged_bdt_model__estimator__min_samples_leaf': [5,
                                                                           10],
                         'bagged_bdt_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                           random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_bdt_model__estimator__criterion': ['gini',
                                                                    'entropy'],
                         'bagged_bdt_model__estimator__max_depth': [3, 6],
                         'bagged_bdt_model__estimator__min_samples_leaf': [5,
                                                                           10],
                         'bagged_bdt_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_bdt_model',
                 BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                    criterion='entropy',
                                                                    max_depth=3,
                                                                    min_samples_leaf=5,
                                                                    random_state=88888888),
                                   n_estimators=100, random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=DecisionTreeClassifier(class_weight='balanced',
                                                   criterion='entropy',
                                                   max_depth=3,
                                                   min_samples_leaf=5,
                                                   random_state=88888888),
                  n_estimators=100, random_state=88888888)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=5, random_state=88888888)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=5, random_state=88888888)
In [159]:
##################################
# Identifying the best model
##################################
bagged_bdt_optimal = bagged_bdt_grid_search.best_estimator_
In [160]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bdt_optimal_f1_cv = bagged_bdt_grid_search.best_score_
bagged_bdt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
bagged_bdt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
In [161]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Decision Trees: ')
print(f"Best Bagged Decision Trees Hyperparameters: {bagged_bdt_grid_search.best_params_}")
Best Bagged Model – Bagged Decision Trees: 
Best Bagged Decision Trees Hyperparameters: {'bagged_bdt_model__estimator__criterion': 'entropy', 'bagged_bdt_model__estimator__max_depth': 3, 'bagged_bdt_model__estimator__min_samples_leaf': 5, 'bagged_bdt_model__n_estimators': 100}
In [162]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bdt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bdt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8792
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [163]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bdt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [164]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bdt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [165]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bdt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [166]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_bdt_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_decision_trees_optimal.pkl"))
Out[166]:
['..\\models\\bagged_model_bagged_decision_trees_optimal.pkl']

1.7.4 Bagged Logistic Regression ¶

In [167]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [168]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_blr_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_blr_model', BaggingClassifier(estimator=LogisticRegression(class_weight='balanced', 
                                                                        random_state=88888888),
                                           random_state=88888888))
])
In [169]:
##################################
# Defining hyperparameter grid
##################################
bagged_blr_hyperparameter_grid = {
    'bagged_blr_model__estimator__C': [0.1, 1.0],
    'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
    'bagged_blr_model__estimator__solver': ['liblinear', 'saga'],
    'bagged_blr_model__n_estimators': [100, 200]
}
In [170]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [171]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_blr_grid_search = GridSearchCV(
    estimator=bagged_blr_pipeline,
    param_grid=bagged_blr_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [172]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [173]:
##################################
# Fitting GridSearchCV
##################################
bagged_blr_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[173]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                                       random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0],
                         'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
                         'bagged_blr_model__estimator__solver': ['liblinear',
                                                                 'saga'],
                         'bagged_blr_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                                       random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_blr_model__estimator__C': [0.1, 1.0],
                         'bagged_blr_model__estimator__penalty': ['l1', 'l2'],
                         'bagged_blr_model__estimator__solver': ['liblinear',
                                                                 'saga'],
                         'bagged_blr_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_blr_model',
                 BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                                random_state=88888888,
                                                                solver='liblinear'),
                                   n_estimators=200, random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=LogisticRegression(class_weight='balanced',
                                               random_state=88888888,
                                               solver='liblinear'),
                  n_estimators=200, random_state=88888888)
LogisticRegression(class_weight='balanced', random_state=88888888,
                   solver='liblinear')
LogisticRegression(class_weight='balanced', random_state=88888888,
                   solver='liblinear')
In [174]:
##################################
# Identifying the best model
##################################
bagged_blr_optimal = bagged_blr_grid_search.best_estimator_
In [175]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_blr_optimal_f1_cv = bagged_blr_grid_search.best_score_
bagged_blr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
bagged_blr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
In [176]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Logistic Regression: ')
print(f"Best Bagged Logistic Regression Hyperparameters: {bagged_blr_grid_search.best_params_}")
Best Bagged Model – Bagged Logistic Regression: 
Best Bagged Logistic Regression Hyperparameters: {'bagged_blr_model__estimator__C': 1.0, 'bagged_blr_model__estimator__penalty': 'l2', 'bagged_blr_model__estimator__solver': 'liblinear', 'bagged_blr_model__n_estimators': 200}
In [177]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_blr_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_blr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8763
F1 Score on Training Data: 0.8906

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.97      0.93      0.95       143
         1.0       0.85      0.93      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [178]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_blr_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [179]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_blr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [180]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_blr_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [181]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_blr_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_logistic_regression_optimal.pkl"))
Out[181]:
['..\\models\\bagged_model_bagged_logistic_regression_optimal.pkl']

1.7.5 Bagged Support Vector Machine ¶

In [182]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [183]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_bsvm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('bagged_bsvm_model', BaggingClassifier(estimator=SVC(class_weight='balanced', 
                                                          random_state=88888888),
                                            random_state=88888888))
])
In [184]:
##################################
# Defining hyperparameter grid
##################################
bagged_bsvm_hyperparameter_grid = {
    'bagged_bsvm_model__estimator__C': [0.1, 1.0],
    'bagged_bsvm_model__estimator__kernel': ['linear', 'rbf'],
    'bagged_bsvm_model__estimator__gamma': ['scale','auto'],
    'bagged_bsvm_model__n_estimators': [100, 200]
}
In [185]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [186]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_bsvm_grid_search = GridSearchCV(
    estimator=bagged_bsvm_pipeline,
    param_grid=bagged_bsvm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [187]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [188]:
##################################
# Fitting GridSearchCV
##################################
bagged_bsvm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[188]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                                        random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0],
                         'bagged_bsvm_model__estimator__gamma': ['scale',
                                                                 'auto'],
                         'bagged_bsvm_model__estimator__kernel': ['linear',
                                                                  'rbf'],
                         'bagged_bsvm_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                        BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                                        random_state=88888888),
                                                          random_state=88888888))]),
             n_jobs=-1,
             param_grid={'bagged_bsvm_model__estimator__C': [0.1, 1.0],
                         'bagged_bsvm_model__estimator__gamma': ['scale',
                                                                 'auto'],
                         'bagged_bsvm_model__estimator__kernel': ['linear',
                                                                  'rbf'],
                         'bagged_bsvm_model__n_estimators': [100, 200]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('bagged_bsvm_model',
                 BaggingClassifier(estimator=SVC(class_weight='balanced',
                                                 kernel='linear',
                                                 random_state=88888888),
                                   n_estimators=100, random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
BaggingClassifier(estimator=SVC(class_weight='balanced', kernel='linear',
                                random_state=88888888),
                  n_estimators=100, random_state=88888888)
SVC(class_weight='balanced', kernel='linear', random_state=88888888)
SVC(class_weight='balanced', kernel='linear', random_state=88888888)
In [189]:
##################################
# Identifying the best model
##################################
bagged_bsvm_optimal = bagged_bsvm_grid_search.best_estimator_
In [190]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_bsvm_optimal_f1_cv = bagged_bsvm_grid_search.best_score_
bagged_bsvm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
bagged_bsvm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
In [191]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model – Bagged Support Vector Machine: ')
print(f"Best Bagged Support Vector Machine Hyperparameters: {bagged_bsvm_grid_search.best_params_}")
Best Bagged Model – Bagged Support Vector Machine: 
Best Bagged Support Vector Machine Hyperparameters: {'bagged_bsvm_model__estimator__C': 1.0, 'bagged_bsvm_model__estimator__gamma': 'scale', 'bagged_bsvm_model__estimator__kernel': 'linear', 'bagged_bsvm_model__n_estimators': 100}
In [192]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_bsvm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_bsvm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8829
F1 Score on Training Data: 0.9062

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.98      0.94      0.96       143
         1.0       0.87      0.95      0.91        61

    accuracy                           0.94       204
   macro avg       0.92      0.94      0.93       204
weighted avg       0.94      0.94      0.94       204

In [193]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, bagged_bsvm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [194]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_bsvm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [195]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, bagged_bsvm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Bagged Support Vector Machine Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [196]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_bsvm_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_bagged_svm_optimal.pkl"))
Out[196]:
['..\\models\\bagged_model_bagged_svm_optimal.pkl']

1.8. Boosted Model Development ¶

1.8.1 AdaBoost ¶

In [197]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [198]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_ab_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=88888888),
                                            random_state=88888888))
])
In [199]:
##################################
# Defining hyperparameter grid
##################################
boosted_ab_hyperparameter_grid = {
    'boosted_ab_model__learning_rate': [0.01, 0.10],  
    'boosted_ab_model__estimator__max_depth': [1, 2],
    'boosted_ab_model__n_estimators': [50, 100] 
}
In [200]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [201]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_ab_grid_search = GridSearchCV(
    estimator=boosted_ab_pipeline,
    param_grid=boosted_ab_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [202]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [203]:
##################################
# Fitting GridSearchCV
##################################
boosted_ab_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[203]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_ab_model',
                                        AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=88888888),
                                                           random_state=88888888))]),
             n_jobs=-1,
             param_grid={'boosted_ab_model__estimator__max_depth': [1, 2],
                         'boosted_ab_model__learning_rate': [0.01, 0.1],
                         'boosted_ab_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_ab_model',
                                        AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=88888888),
                                                           random_state=88888888))]),
             n_jobs=-1,
             param_grid={'boosted_ab_model__estimator__max_depth': [1, 2],
                         'boosted_ab_model__learning_rate': [0.01, 0.1],
                         'boosted_ab_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_ab_model',
                 AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2,
                                                                     random_state=88888888),
                                    learning_rate=0.01,
                                    random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2,
                                                    random_state=88888888),
                   learning_rate=0.01, random_state=88888888)
DecisionTreeClassifier(max_depth=2, random_state=88888888)
DecisionTreeClassifier(max_depth=2, random_state=88888888)
In [204]:
##################################
# Identifying the best model
##################################
boosted_ab_optimal = boosted_ab_grid_search.best_estimator_
In [205]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_ab_optimal_f1_cv = boosted_ab_grid_search.best_score_
boosted_ab_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
boosted_ab_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
In [206]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - AdaBoost: ')
print(f"Best AdaBoost Hyperparameters: {boosted_ab_grid_search.best_params_}")
Best Boosted Model - AdaBoost: 
Best AdaBoost Hyperparameters: {'boosted_ab_model__estimator__max_depth': 2, 'boosted_ab_model__learning_rate': 0.01, 'boosted_ab_model__n_estimators': 50}
In [207]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_ab_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_ab_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8900
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [208]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_ab_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [209]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_ab_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [210]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_ab_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [211]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_ab_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_adaboost_optimal.pkl"))
Out[211]:
['..\\models\\boosted_model_adaboost_optimal.pkl']

1.8.2 Gradient Boosting ¶

In [212]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [213]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_gb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_gb_model', GradientBoostingClassifier(random_state=88888888))
])
In [214]:
##################################
# Defining hyperparameter grid
##################################
boosted_gb_hyperparameter_grid = {
    'boosted_gb_model__learning_rate': [0.01, 0.10],
    'boosted_gb_model__max_depth': [3, 6], 
    'boosted_gb_model__min_samples_leaf': [5, 10],
    'boosted_gb_model__n_estimators': [50, 100] 
}
In [215]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [216]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_gb_grid_search = GridSearchCV(
    estimator=boosted_gb_pipeline,
    param_grid=boosted_gb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [217]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [218]:
##################################
# Fitting GridSearchCV
##################################
boosted_gb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[218]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_gb_model',
                                        GradientBoostingClassifier(random_state=88888888))]),
             n_jobs=-1,
             param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1],
                         'boosted_gb_model__max_depth': [3, 6],
                         'boosted_gb_model__min_samples_leaf': [5, 10],
                         'boosted_gb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_gb_model',
                                        GradientBoostingClassifier(random_state=88888888))]),
             n_jobs=-1,
             param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1],
                         'boosted_gb_model__max_depth': [3, 6],
                         'boosted_gb_model__min_samples_leaf': [5, 10],
                         'boosted_gb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_gb_model',
                 GradientBoostingClassifier(learning_rate=0.01,
                                            min_samples_leaf=5,
                                            random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
GradientBoostingClassifier(learning_rate=0.01, min_samples_leaf=5,
                           random_state=88888888)
In [219]:
##################################
# Identifying the best model
##################################
boosted_gb_optimal = boosted_gb_grid_search.best_estimator_
In [220]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_gb_optimal_f1_cv = boosted_gb_grid_search.best_score_
boosted_gb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
boosted_gb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
In [221]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Gradient Boosting: ')
print(f"Best Gradient Boosting Hyperparameters: {boosted_gb_grid_search.best_params_}")
Best Boosted Model - Gradient Boosting: 
Best Gradient Boosting Hyperparameters: {'boosted_gb_model__learning_rate': 0.01, 'boosted_gb_model__max_depth': 3, 'boosted_gb_model__min_samples_leaf': 5, 'boosted_gb_model__n_estimators': 100}
In [222]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_gb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_gb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8803
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [223]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_gb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [224]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_gb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [225]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_gb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [226]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_gb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_gradient_boosting_optimal.pkl"))
Out[226]:
['..\\models\\boosted_model_gradient_boosting_optimal.pkl']

1.8.3 XGBoost ¶

In [227]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [228]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_xgb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_xgb_model', XGBClassifier(scale_pos_weight=2.0, 
                                        random_state=88888888,
                                        subsample=0.7,
                                        colsample_bytree=0.7,
                                        eval_metric='logloss'))
])
In [229]:
##################################
# Defining hyperparameter grid
##################################
boosted_xgb_hyperparameter_grid = {
    'boosted_xgb_model__learning_rate': [0.01, 0.10],
    'boosted_xgb_model__max_depth': [3, 6], 
    'boosted_xgb_model__gamma': [0.1, 0.2],
    'boosted_xgb_model__n_estimators': [50, 100] 
}
In [230]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [231]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_xgb_grid_search = GridSearchCV(
    estimator=boosted_xgb_pipeline,
    param_grid=boosted_xgb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [232]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [233]:
##################################
# Fitting GridSearchCV
##################################
boosted_xgb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[233]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                                      missing=nan,
                                                      monotone_constraints=None,
                                                      multi_strategy=None,
                                                      n_estimators=None,
                                                      n_jobs=None,
                                                      num_parallel_tree=None,
                                                      random_state=88888888, ...))]),
             n_jobs=-1,
             param_grid={'boosted_xgb_model__gamma': [0.1, 0.2],
                         'boosted_xgb_model__learning_rate': [0.01, 0.1],
                         'boosted_xgb_model__max_depth': [3, 6],
                         'boosted_xgb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])]...
                                                      missing=nan,
                                                      monotone_constraints=None,
                                                      multi_strategy=None,
                                                      n_estimators=None,
                                                      n_jobs=None,
                                                      num_parallel_tree=None,
                                                      random_state=88888888, ...))]),
             n_jobs=-1,
             param_grid={'boosted_xgb_model__gamma': [0.1, 0.2],
                         'boosted_xgb_model__learning_rate': [0.01, 0.1],
                         'boosted_xgb_model__max_depth': [3, 6],
                         'boosted_xgb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_xgb_model',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_byle...
                               feature_types=None, gamma=0.1, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=0.01,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=3, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=50, n_jobs=None,
                               num_parallel_tree=None, random_state=88888888, ...))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.7, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=0.1, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=88888888, ...)
In [234]:
##################################
# Identifying the best model
##################################
boosted_xgb_optimal = boosted_xgb_grid_search.best_estimator_
In [235]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_xgb_optimal_f1_cv = boosted_xgb_grid_search.best_score_
boosted_xgb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
boosted_xgb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
In [236]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - XGBoost: ')
print(f"Best XGBoost Hyperparameters: {boosted_xgb_grid_search.best_params_}")
Best Boosted Model - XGBoost: 
Best XGBoost Hyperparameters: {'boosted_xgb_model__gamma': 0.1, 'boosted_xgb_model__learning_rate': 0.01, 'boosted_xgb_model__max_depth': 3, 'boosted_xgb_model__n_estimators': 50}
In [237]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_xgb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_xgb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8900
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [238]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_xgb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [239]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_xgb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [240]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_xgb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [241]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_xgb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_xgboost_optimal.pkl"))
Out[241]:
['..\\models\\boosted_model_xgboost_optimal.pkl']

1.8.4 Light GBM ¶

In [242]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [243]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_lgbm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_lgbm_model', LGBMClassifier(scale_pos_weight=2.0, 
                                          random_state=88888888,
                                          max_depth=-1,
                                          feature_fraction =0.7,
                                          bagging_fraction=0.7,
                                          verbose=-1))
])
In [244]:
##################################
# Defining hyperparameter grid
##################################
boosted_lgbm_hyperparameter_grid = {
    'boosted_lgbm_model__learning_rate': [0.01, 0.10],
    'boosted_lgbm_model__min_child_samples': [3, 5], 
    'boosted_lgbm_model__num_leaves': [8, 16],
    'boosted_lgbm_model__n_estimators': [50, 100] 
}
In [245]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [246]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_lgbm_grid_search = GridSearchCV(
    estimator=boosted_lgbm_pipeline,
    param_grid=boosted_lgbm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [247]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [248]:
##################################
# Fitting GridSearchCV
##################################
boosted_lgbm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[248]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_lgbm_model',
                                        LGBMClassifier(bagging_fraction=0.7,
                                                       feature_fraction=0.7,
                                                       random_state=88888888,
                                                       scale_pos_weight=2.0,
                                                       verbose=-1))]),
             n_jobs=-1,
             param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1],
                         'boosted_lgbm_model__min_child_samples': [3, 5],
                         'boosted_lgbm_model__n_estimators': [50, 100],
                         'boosted_lgbm_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_lgbm_model',
                                        LGBMClassifier(bagging_fraction=0.7,
                                                       feature_fraction=0.7,
                                                       random_state=88888888,
                                                       scale_pos_weight=2.0,
                                                       verbose=-1))]),
             n_jobs=-1,
             param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1],
                         'boosted_lgbm_model__min_child_samples': [3, 5],
                         'boosted_lgbm_model__n_estimators': [50, 100],
                         'boosted_lgbm_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_lgbm_model',
                 LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7,
                                learning_rate=0.01, min_child_samples=5,
                                num_leaves=8, random_state=88888888,
                                scale_pos_weight=2.0, verbose=-1))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
LGBMClassifier(bagging_fraction=0.7, feature_fraction=0.7, learning_rate=0.01,
               min_child_samples=5, num_leaves=8, random_state=88888888,
               scale_pos_weight=2.0, verbose=-1)
In [249]:
##################################
# Identifying the best model
##################################
boosted_lgbm_optimal = boosted_lgbm_grid_search.best_estimator_
In [250]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.utils.validation')
boosted_lgbm_optimal_f1_cv = boosted_lgbm_grid_search.best_score_
boosted_lgbm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
boosted_lgbm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
In [251]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Light GBM: ')
print(f"Best Light GBM Hyperparameters: {boosted_lgbm_grid_search.best_params_}")
Best Boosted Model - Light GBM: 
Best Light GBM Hyperparameters: {'boosted_lgbm_model__learning_rate': 0.01, 'boosted_lgbm_model__min_child_samples': 5, 'boosted_lgbm_model__n_estimators': 100, 'boosted_lgbm_model__num_leaves': 8}
In [252]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_lgbm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_lgbm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8831
F1 Score on Training Data: 0.9344

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       143
         1.0       0.93      0.93      0.93        61

    accuracy                           0.96       204
   macro avg       0.95      0.95      0.95       204
weighted avg       0.96      0.96      0.96       204

In [253]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_lgbm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [254]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_lgbm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8000

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.92      0.92      0.92        49
         1.0       0.80      0.80      0.80        20

    accuracy                           0.88        69
   macro avg       0.86      0.86      0.86        69
weighted avg       0.88      0.88      0.88        69

In [255]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_lgbm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [256]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_lgbm_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_light_gbm_optimal.pkl"))
Out[256]:
['..\\models\\boosted_model_light_gbm_optimal.pkl']

1.8.5 CatBoost ¶

In [257]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [258]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_cb_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('boosted_cb_model', LGBMClassifier(scale_pos_weight=2.0, 
                                        random_state=88888888,
                                        subsample =0.7,
                                        colsample_bylevel=0.7))
])
In [259]:
##################################
# Defining hyperparameter grid
##################################
boosted_cb_hyperparameter_grid = {
    'boosted_cb_model__learning_rate': [0.01, 0.10],
    'boosted_cb_model__max_depth': [3, 6], 
    'boosted_cb_model__num_leaves': [8, 16],
    'boosted_cb_model__iterations': [50, 100] 
}
In [260]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [261]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_cb_grid_search = GridSearchCV(
    estimator=boosted_cb_pipeline,
    param_grid=boosted_cb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [262]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [263]:
##################################
# Fitting GridSearchCV
##################################
boosted_cb_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[263]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_cb_model',
                                        LGBMClassifier(colsample_bylevel=0.7,
                                                       random_state=88888888,
                                                       scale_pos_weight=2.0,
                                                       subsample=0.7))]),
             n_jobs=-1,
             param_grid={'boosted_cb_model__iterations': [50, 100],
                         'boosted_cb_model__learning_rate': [0.01, 0.1],
                         'boosted_cb_model__max_depth': [3, 6],
                         'boosted_cb_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('boosted_cb_model',
                                        LGBMClassifier(colsample_bylevel=0.7,
                                                       random_state=88888888,
                                                       scale_pos_weight=2.0,
                                                       subsample=0.7))]),
             n_jobs=-1,
             param_grid={'boosted_cb_model__iterations': [50, 100],
                         'boosted_cb_model__learning_rate': [0.01, 0.1],
                         'boosted_cb_model__max_depth': [3, 6],
                         'boosted_cb_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('boosted_cb_model',
                 LGBMClassifier(colsample_bylevel=0.7, iterations=50,
                                learning_rate=0.01, max_depth=6, num_leaves=8,
                                random_state=88888888, scale_pos_weight=2.0,
                                subsample=0.7))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
LGBMClassifier(colsample_bylevel=0.7, iterations=50, learning_rate=0.01,
               max_depth=6, num_leaves=8, random_state=88888888,
               scale_pos_weight=2.0, subsample=0.7)
In [264]:
##################################
# Identifying the best model
##################################
boosted_cb_optimal = boosted_cb_grid_search.best_estimator_
In [265]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_cb_optimal_f1_cv = boosted_cb_grid_search.best_score_
boosted_cb_optimal_f1_train = f1_score(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
boosted_cb_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
In [266]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - CatBoost: ')
print(f"Best CatBoost Hyperparameters: {boosted_cb_grid_search.best_params_}")
Best Boosted Model - CatBoost: 
Best CatBoost Hyperparameters: {'boosted_cb_model__iterations': 50, 'boosted_cb_model__learning_rate': 0.01, 'boosted_cb_model__max_depth': 6, 'boosted_cb_model__num_leaves': 8}
In [267]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_cb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_cb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8722
F1 Score on Training Data: 0.8889

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.94      0.95       143
         1.0       0.86      0.92      0.89        61

    accuracy                           0.93       204
   macro avg       0.91      0.93      0.92       204
weighted avg       0.93      0.93      0.93       204

In [268]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, boosted_cb_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [269]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_cb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [270]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, boosted_cb_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [271]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_cb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_catboost_optimal.pkl"))
Out[271]:
['..\\models\\boosted_model_catboost_optimal.pkl']

1.9. Stacked Model Development ¶

1.9.1 Base Learner - K-Nearest Neighbors ¶

In [272]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
    remainder='passthrough',
    force_int_remainder_cols=False)
In [273]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_knn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_knn_model', KNeighborsClassifier())
])
In [274]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_knn_hyperparameter_grid = {
    'stacked_baselearner_knn_model__n_neighbors': [3, 5],
    'stacked_baselearner_knn_model__weights': ['uniform', 'distance'],
    'stacked_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
In [275]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [276]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_knn_grid_search = GridSearchCV(
    estimator=stacked_baselearner_knn_pipeline,
    param_grid=stacked_baselearner_knn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [277]:
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [278]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[278]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'stacked_baselearner_knn_model__n_neighbors': [3, 5],
                         'stacked_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'stacked_baselearner_knn_model__n_neighbors': [3, 5],
                         'stacked_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_knn_model',
                 KNeighborsClassifier(n_neighbors=3, weights='distance'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3, weights='distance')
In [279]:
##################################
# Identifying the best model
##################################
stacked_baselearner_knn_optimal = stacked_baselearner_knn_grid_search.best_estimator_
In [280]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_knn_optimal_f1_cv = stacked_baselearner_knn_grid_search.best_score_
stacked_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
stacked_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
In [281]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner KNN: ')
print(f"Best Stacked Base Learner KNN Hyperparameters: {stacked_baselearner_knn_grid_search.best_params_}")
Best Stacked Base Learner KNN: 
Best Stacked Base Learner KNN Hyperparameters: {'stacked_baselearner_knn_model__metric': 'minkowski', 'stacked_baselearner_knn_model__n_neighbors': 3, 'stacked_baselearner_knn_model__weights': 'distance'}
In [282]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6792
F1 Score on Training Data: 0.9917

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      1.00      1.00       143
         1.0       1.00      0.98      0.99        61

    accuracy                           1.00       204
   macro avg       1.00      0.99      0.99       204
weighted avg       1.00      1.00      1.00       204

In [283]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [284]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.7368

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.88      0.92      0.90        49
         1.0       0.78      0.70      0.74        20

    accuracy                           0.86        69
   macro avg       0.83      0.81      0.82        69
weighted avg       0.85      0.86      0.85        69

In [285]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [286]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_knn_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_knn_optimal.pkl"))
Out[286]:
['..\\models\\stacked_model_baselearner_knn_optimal.pkl']

1.9.2 Base Learner - Support Vector Machine ¶

In [287]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [288]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_svm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_svm_model', SVC(class_weight='balanced',
                                          random_state=88888888))
])
In [289]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_svm_hyperparameter_grid = {
    'stacked_baselearner_svm_model__C': [0.1, 1.0],
    'stacked_baselearner_svm_model__kernel': ['linear', 'rbf'],
    'stacked_baselearner_svm_model__gamma': ['scale','auto']
}
In [290]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [291]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_svm_grid_search = GridSearchCV(
    estimator=stacked_baselearner_svm_pipeline,
    param_grid=stacked_baselearner_svm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [292]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [293]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[293]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0],
                         'stacked_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'stacked_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_svm_model__C': [0.1, 1.0],
                         'stacked_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'stacked_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_svm_model',
                 SVC(class_weight='balanced', kernel='linear',
                     random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=88888888)
In [294]:
##################################
# Identifying the best model
##################################
stacked_baselearner_svm_optimal = stacked_baselearner_svm_grid_search.best_estimator_
In [295]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_svm_optimal_f1_cv = stacked_baselearner_svm_grid_search.best_score_
stacked_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
stacked_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
In [296]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner SVM: ')
print(f"Best Stacked Base Learner SVM Hyperparameters: {stacked_baselearner_svm_grid_search.best_params_}")
Best Stacked Base Learner SVM: 
Best Stacked Base Learner SVM Hyperparameters: {'stacked_baselearner_svm_model__C': 1.0, 'stacked_baselearner_svm_model__gamma': 'scale', 'stacked_baselearner_svm_model__kernel': 'linear'}
In [297]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8827
F1 Score on Training Data: 0.9008

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      0.92      0.95       143
         1.0       0.84      0.97      0.90        61

    accuracy                           0.94       204
   macro avg       0.91      0.95      0.93       204
weighted avg       0.94      0.94      0.94       204

In [298]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [299]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [300]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [301]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_svm_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_svm_optimal.pkl"))
Out[301]:
['..\\models\\stacked_model_baselearner_svm_optimal.pkl']

1.9.3 Base Learner - Ridge Classifier ¶

In [302]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [303]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_rc_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
                                                     random_state=88888888))
])
In [304]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_rc_hyperparameter_grid = {
    'stacked_baselearner_rc_model__alpha': [1.00, 2.00],
    'stacked_baselearner_rc_model__solver': ['sag', 'saga'],
    'stacked_baselearner_rc_model__tol': [1e-3, 1e-4]
}
In [305]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [306]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_rc_grid_search = GridSearchCV(
    estimator=stacked_baselearner_rc_pipeline,
    param_grid=stacked_baselearner_rc_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [307]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [308]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[308]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0],
                         'stacked_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'stacked_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_rc_model__alpha': [1.0, 2.0],
                         'stacked_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'stacked_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_rc_model',
                 RidgeClassifier(alpha=2.0, class_weight='balanced',
                                 random_state=88888888, solver='sag',
                                 tol=0.001))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=88888888,
                solver='sag', tol=0.001)
In [309]:
##################################
# Identifying the best model
##################################
stacked_baselearner_rc_optimal = stacked_baselearner_rc_grid_search.best_estimator_
In [310]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_rc_optimal_f1_cv = stacked_baselearner_rc_grid_search.best_score_
stacked_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
stacked_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
In [311]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Ridge Classifier: ')
print(f"Best Stacked Base Learner Ridge Classifier Hyperparameters: {stacked_baselearner_rc_grid_search.best_params_}")
Best Stacked Base Learner Ridge Classifier: 
Best Stacked Base Learner Ridge Classifier Hyperparameters: {'stacked_baselearner_rc_model__alpha': 2.0, 'stacked_baselearner_rc_model__solver': 'sag', 'stacked_baselearner_rc_model__tol': 0.001}
In [312]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8655
F1 Score on Training Data: 0.8819

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.93      0.95       143
         1.0       0.85      0.92      0.88        61

    accuracy                           0.93       204
   macro avg       0.91      0.92      0.91       204
weighted avg       0.93      0.93      0.93       204

In [313]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [314]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [315]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [316]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_rc_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_ridge_classifier_optimal.pkl"))
Out[316]:
['..\\models\\stacked_model_baselearner_ridge_classifier_optimal.pkl']

1.9.4 Base Learner - Neural Network ¶

In [317]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [318]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_nn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_nn_model', MLPClassifier(max_iter=500,
                                                   solver='lbfgs',
                                                   early_stopping=False,
                                                   random_state=88888888))
])
In [319]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_nn_hyperparameter_grid = {
    'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
    'stacked_baselearner_nn_model__activation': ['relu', 'tanh'],
    'stacked_baselearner_nn_model__alpha': [0.0001, 0.001]
}
In [320]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [321]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_nn_grid_search = GridSearchCV(
    estimator=stacked_baselearner_nn_pipeline,
    param_grid=stacked_baselearner_nn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [322]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [323]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[323]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=88888888,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'stacked_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=88888888,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'stacked_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'stacked_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_nn_model',
                 MLPClassifier(hidden_layer_sizes=(50,), max_iter=500,
                               random_state=88888888, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=88888888,
              solver='lbfgs')
In [324]:
##################################
# Identifying the best model
##################################
stacked_baselearner_nn_optimal = stacked_baselearner_nn_grid_search.best_estimator_
In [325]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_nn_optimal_f1_cv = stacked_baselearner_nn_grid_search.best_score_
stacked_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
stacked_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
In [326]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Neural Network: ')
print(f"Best Stacked Base Learner Neural Network Hyperparameters: {stacked_baselearner_nn_grid_search.best_params_}")
Best Stacked Base Learner Neural Network: 
Best Stacked Base Learner Neural Network Hyperparameters: {'stacked_baselearner_nn_model__activation': 'relu', 'stacked_baselearner_nn_model__alpha': 0.0001, 'stacked_baselearner_nn_model__hidden_layer_sizes': (50,)}
In [327]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8674
F1 Score on Training Data: 0.9180

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       143
         1.0       0.92      0.92      0.92        61

    accuracy                           0.95       204
   macro avg       0.94      0.94      0.94       204
weighted avg       0.95      0.95      0.95       204

In [328]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [329]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.7692

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.90      0.92      0.91        49
         1.0       0.79      0.75      0.77        20

    accuracy                           0.87        69
   macro avg       0.84      0.83      0.84        69
weighted avg       0.87      0.87      0.87        69

In [330]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [331]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_nn_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_neural_network_optimal.pkl"))
Out[331]:
['..\\models\\stacked_model_baselearner_neural_network_optimal.pkl']

1.9.5 Base Learner - Decision Tree ¶

In [332]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [333]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
stacked_baselearner_dt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('stacked_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
                                                            random_state=88888888))
])
In [334]:
##################################
# Defining hyperparameter grid
##################################
stacked_baselearner_dt_hyperparameter_grid = {
    'stacked_baselearner_dt_model__criterion': ['gini', 'entropy'],
    'stacked_baselearner_dt_model__max_depth': [3, 5],
    'stacked_baselearner_dt_model__min_samples_leaf': [5, 10]
}
In [335]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [336]:
##################################
# Performing Grid Search with cross-validation
##################################
stacked_baselearner_dt_grid_search = GridSearchCV(
    estimator=stacked_baselearner_dt_pipeline,
    param_grid=stacked_baselearner_dt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [337]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [338]:
##################################
# Fitting GridSearchCV
##################################
stacked_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[338]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'stacked_baselearner_dt_model__max_depth': [3, 5],
                         'stacked_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('stacked_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'stacked_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'stacked_baselearner_dt_model__max_depth': [3, 5],
                         'stacked_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('stacked_baselearner_dt_model',
                 DecisionTreeClassifier(class_weight='balanced',
                                        criterion='entropy', max_depth=3,
                                        min_samples_leaf=5,
                                        random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=5, random_state=88888888)
In [339]:
##################################
# Identifying the best model
##################################
stacked_baselearner_dt_optimal = stacked_baselearner_dt_grid_search.best_estimator_
In [340]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
stacked_baselearner_dt_optimal_f1_cv = stacked_baselearner_dt_grid_search.best_score_
stacked_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
stacked_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
In [341]:
##################################
# Identifying the optimal model
##################################
print('Best Stacked Base Learner Decision Trees: ')
print(f"Best Stacked Base Learner Decision Trees Hyperparameters: {stacked_baselearner_dt_grid_search.best_params_}")
Best Stacked Base Learner Decision Trees: 
Best Stacked Base Learner Decision Trees Hyperparameters: {'stacked_baselearner_dt_model__criterion': 'entropy', 'stacked_baselearner_dt_model__max_depth': 3, 'stacked_baselearner_dt_model__min_samples_leaf': 5}
In [342]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {stacked_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {stacked_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8767
F1 Score on Training Data: 0.8788

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.98      0.91      0.94       143
         1.0       0.82      0.95      0.88        61

    accuracy                           0.92       204
   macro avg       0.90      0.93      0.91       204
weighted avg       0.93      0.92      0.92       204

In [343]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [344]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {stacked_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [345]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Base Learner Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Base Learner Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [346]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(stacked_baselearner_dt_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_baselearner_decision_trees_optimal.pkl"))
Out[346]:
['..\\models\\stacked_model_baselearner_decision_trees_optimal.pkl']

1.9.6 Meta Learner - Logistic Regression ¶

In [347]:
##################################
# Defining the stacking strategy (5-fold CV)
##################################
stacking_strategy = KFold(n_splits=5,
                          shuffle=True,
                          random_state=88888888)
In [348]:
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
stacked_baselearners = {}
stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in stacked_baselearner_model:
    stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
    stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)
        
In [349]:
##################################
# Initializing the meta-feature matrices
##################################
meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
meta_validation_stacked = np.zeros((X_preprocessed_validation.shape[0], len(stacked_baselearners)))
In [350]:
##################################
# Generating out-of-fold predictions for training the meta learner
##################################
for i, (name, model) in enumerate(stacked_baselearners.items()):
    oof_preds = np.zeros(X_preprocessed_train.shape[0])
    validation_fold_preds = np.zeros((X_preprocessed_validation.shape[0], stacking_strategy.get_n_splits()))

    for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
        model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
        oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
        validation_fold_preds[:, j] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
        
    # Extracting the meta-feature matrix for the train data
    meta_train_stacked[:, i] = oof_preds
    # Extracting the meta-feature matrix for the validation data
    # Averaging the validation predictions across folds
    meta_validation_stacked[:, i] = validation_fold_preds.mean(axis=1)  
In [351]:
##################################
# Training the meta learner on the stacked features
##################################
stacked_metalearner_lr_optimal = LogisticRegression(class_weight='balanced', 
                                            penalty='l2',
                                            C=1.0,
                                            solver='lbfgs',
                                            random_state=88888888)
stacked_metalearner_lr_optimal.fit(meta_train_stacked, y_preprocessed_train_encoded)
Out[351]:
LogisticRegression(class_weight='balanced', random_state=88888888)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=88888888)
In [352]:
##################################
# Saving the meta learner model
# developed from the meta-train data
################################## 
joblib.dump(stacked_metalearner_lr_optimal, 
            os.path.join("..", MODELS_PATH, "stacked_model_metalearner_logistic_regression_optimal.pkl"))
Out[352]:
['..\\models\\stacked_model_metalearner_logistic_regression_optimal.pkl']
In [353]:
##################################
# Creating a function to extract the 
# meta-feature matrices for new data
################################## 
def extract_stacked_metafeature_matrix(X_preprocessed_new):
    ##################################
    # Loading the pre-trained base learners
    # from the previously saved pickle files
    ##################################
    stacked_baselearners = {}
    stacked_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
    for name in stacked_baselearner_model:
        stacked_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"stacked_model_baselearner_{name}_optimal.pkl"))
        stacked_baselearners[name] = joblib.load(stacked_baselearner_model_path)

    ##################################
    # Generating meta-features for new data
    ##################################
    meta_train_stacked = np.zeros((X_preprocessed_train.shape[0], len(stacked_baselearners)))
    meta_new_stacked = np.zeros((X_preprocessed_new.shape[0], len(stacked_baselearners)))

    ##################################
    # Generating out-of-fold predictions for training the meta learner
    ##################################
    for i, (name, model) in enumerate(stacked_baselearners.items()):
        oof_preds = np.zeros(X_preprocessed_train.shape[0])
        new_fold_preds = np.zeros((X_preprocessed_new.shape[0], stacking_strategy.get_n_splits()))

        for j, (train_idx, val_idx) in enumerate(stacking_strategy.split(X_preprocessed_train)):
            model.fit(X_preprocessed_train.iloc[train_idx], y_preprocessed_train_encoded[train_idx])
            oof_preds[val_idx] = model.predict_proba(X_preprocessed_train.iloc[val_idx])[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_train.iloc[val_idx])
            new_fold_preds[:, j] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)
        
        # Extracting the meta-feature matrix for the train data
        meta_train_stacked[:, i] = oof_preds
        # Extracting the meta-feature matrix for the new data
        # Averaging the new predictions across folds
        meta_new_stacked[:, i] = new_fold_preds.mean(axis=1)

    return meta_new_stacked
    
In [354]:
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
stacked_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
stacked_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
In [355]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {stacked_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.9062

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.98      0.94      0.96       143
         1.0       0.87      0.95      0.91        61

    accuracy                           0.94       204
   macro avg       0.92      0.94      0.93       204
weighted avg       0.94      0.94      0.94       204

In [356]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [357]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {stacked_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [358]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, stacked_metalearner_lr_optimal.predict(extract_stacked_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Stacked Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image

1.10. Blended Model Development ¶

1.10.1 Base Learner - K-Nearest Neighbors ¶

In [359]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
    remainder='passthrough',
    force_int_remainder_cols=False)
In [360]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_knn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_knn_model', KNeighborsClassifier())
])
In [361]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_knn_hyperparameter_grid = {
    'blended_baselearner_knn_model__n_neighbors': [3, 5],
    'blended_baselearner_knn_model__weights': ['uniform', 'distance'],
    'blended_baselearner_knn_model__metric': ['minkowski', 'euclidean']
}
In [362]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [363]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_knn_grid_search = GridSearchCV(
    estimator=blended_baselearner_knn_pipeline,
    param_grid=blended_baselearner_knn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [364]:
##################################
# Encoding the response variables
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [365]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_knn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[365]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'blended_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'blended_baselearner_knn_model__n_neighbors': [3, 5],
                         'blended_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_knn_model',
                                        KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'blended_baselearner_knn_model__metric': ['minkowski',
                                                                   'euclidean'],
                         'blended_baselearner_knn_model__n_neighbors': [3, 5],
                         'blended_baselearner_knn_model__weights': ['uniform',
                                                                    'distance']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_knn_model',
                 KNeighborsClassifier(n_neighbors=3, weights='distance'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
KNeighborsClassifier(n_neighbors=3, weights='distance')
In [366]:
##################################
# Identifying the best model
##################################
blended_baselearner_knn_optimal = blended_baselearner_knn_grid_search.best_estimator_
In [367]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_knn_optimal_f1_cv = blended_baselearner_knn_grid_search.best_score_
blended_baselearner_knn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
blended_baselearner_knn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
In [368]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner KNN: ')
print(f"Best Blended Base Learner KNN Hyperparameters: {blended_baselearner_knn_grid_search.best_params_}")
Best Blended Base Learner KNN: 
Best Blended Base Learner KNN Hyperparameters: {'blended_baselearner_knn_model__metric': 'minkowski', 'blended_baselearner_knn_model__n_neighbors': 3, 'blended_baselearner_knn_model__weights': 'distance'}
In [369]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_knn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_knn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.6792
F1 Score on Training Data: 0.9917

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      1.00      1.00       143
         1.0       1.00      0.98      0.99        61

    accuracy                           1.00       204
   macro avg       1.00      0.99      0.99       204
weighted avg       1.00      1.00      1.00       204

In [370]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [371]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_knn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.7368

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.88      0.92      0.90        49
         1.0       0.78      0.70      0.74        20

    accuracy                           0.86        69
   macro avg       0.83      0.81      0.82        69
weighted avg       0.85      0.86      0.85        69

In [372]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_knn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner KNN Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [373]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_knn_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_knn_optimal.pkl"))
Out[373]:
['..\\models\\blended_model_baselearner_knn_optimal.pkl']

1.10.2 Base Learner - Support Vector Machine ¶

In [374]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [375]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_svm_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_svm_model', SVC(class_weight='balanced',
                                          random_state=88888888))
])
In [376]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_svm_hyperparameter_grid = {
    'blended_baselearner_svm_model__C': [0.1, 1.0],
    'blended_baselearner_svm_model__kernel': ['linear', 'rbf'],
    'blended_baselearner_svm_model__gamma': ['scale','auto']
}
In [377]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [378]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_svm_grid_search = GridSearchCV(
    estimator=blended_baselearner_svm_pipeline,
    param_grid=blended_baselearner_svm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [379]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [380]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_svm_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[380]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0],
                         'blended_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'blended_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_svm_model',
                                        SVC(class_weight='balanced',
                                            random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_svm_model__C': [0.1, 1.0],
                         'blended_baselearner_svm_model__gamma': ['scale',
                                                                  'auto'],
                         'blended_baselearner_svm_model__kernel': ['linear',
                                                                   'rbf']},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_svm_model',
                 SVC(class_weight='balanced', kernel='linear',
                     random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
SVC(class_weight='balanced', kernel='linear', random_state=88888888)
In [381]:
##################################
# Identifying the best model
##################################
blended_baselearner_svm_optimal = blended_baselearner_svm_grid_search.best_estimator_
In [382]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_svm_optimal_f1_cv = blended_baselearner_svm_grid_search.best_score_
blended_baselearner_svm_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
blended_baselearner_svm_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
In [383]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner SVM: ')
print(f"Best Blended Base Learner SVM Hyperparameters: {blended_baselearner_svm_grid_search.best_params_}")
Best Blended Base Learner SVM: 
Best Blended Base Learner SVM Hyperparameters: {'blended_baselearner_svm_model__C': 1.0, 'blended_baselearner_svm_model__gamma': 'scale', 'blended_baselearner_svm_model__kernel': 'linear'}
In [384]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_svm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_svm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8827
F1 Score on Training Data: 0.9008

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.99      0.92      0.95       143
         1.0       0.84      0.97      0.90        61

    accuracy                           0.94       204
   macro avg       0.91      0.95      0.93       204
weighted avg       0.94      0.94      0.94       204

In [385]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [386]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_svm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [387]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_svm_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner SVM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [388]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_svm_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_svm_optimal.pkl"))
Out[388]:
['..\\models\\blended_model_baselearner_svm_optimal.pkl']

1.10.3 Base Learner - Ridge Classifier ¶

In [389]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [390]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_rc_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_rc_model', RidgeClassifier(class_weight='balanced',
                                                     random_state=88888888))
])
In [391]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_rc_hyperparameter_grid = {
    'blended_baselearner_rc_model__alpha': [1.00, 2.00],
    'blended_baselearner_rc_model__solver': ['sag', 'saga'],
    'blended_baselearner_rc_model__tol': [1e-3, 1e-4]
}
In [392]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [393]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_rc_grid_search = GridSearchCV(
    estimator=blended_baselearner_rc_pipeline,
    param_grid=blended_baselearner_rc_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [394]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [395]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_rc_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[395]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0],
                         'blended_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'blended_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_rc_model',
                                        RidgeClassifier(class_weight='balanced',
                                                        random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_rc_model__alpha': [1.0, 2.0],
                         'blended_baselearner_rc_model__solver': ['sag',
                                                                  'saga'],
                         'blended_baselearner_rc_model__tol': [0.001, 0.0001]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_rc_model',
                 RidgeClassifier(alpha=2.0, class_weight='balanced',
                                 random_state=88888888, solver='sag',
                                 tol=0.001))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
RidgeClassifier(alpha=2.0, class_weight='balanced', random_state=88888888,
                solver='sag', tol=0.001)
In [396]:
##################################
# Identifying the best model
##################################
blended_baselearner_rc_optimal = blended_baselearner_rc_grid_search.best_estimator_
In [397]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_rc_optimal_f1_cv = blended_baselearner_rc_grid_search.best_score_
blended_baselearner_rc_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
blended_baselearner_rc_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
In [398]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Ridge Classifier: ')
print(f"Best Blended Base Learner Ridge Classifier Hyperparameters: {blended_baselearner_rc_grid_search.best_params_}")
Best Blended Base Learner Ridge Classifier: 
Best Blended Base Learner Ridge Classifier Hyperparameters: {'blended_baselearner_rc_model__alpha': 2.0, 'blended_baselearner_rc_model__solver': 'sag', 'blended_baselearner_rc_model__tol': 0.001}
In [399]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_rc_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_rc_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8655
F1 Score on Training Data: 0.8819

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.96      0.93      0.95       143
         1.0       0.85      0.92      0.88        61

    accuracy                           0.93       204
   macro avg       0.91      0.92      0.91       204
weighted avg       0.93      0.93      0.93       204

In [400]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [401]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_rc_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [402]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_rc_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Ridge Classifier Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [403]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_rc_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_ridge_classifier_optimal.pkl"))
Out[403]:
['..\\models\\blended_model_baselearner_ridge_classifier_optimal.pkl']

1.10.4 Base Learner - Neural Network ¶

In [404]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [405]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_nn_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_nn_model', MLPClassifier(max_iter=500,
                                                   solver='lbfgs',
                                                   early_stopping=False,
                                                   random_state=88888888))
])
In [406]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_nn_hyperparameter_grid = {
    'blended_baselearner_nn_model__hidden_layer_sizes': [(50,), (100,)],
    'blended_baselearner_nn_model__activation': ['relu', 'tanh'],
    'blended_baselearner_nn_model__alpha': [0.0001, 0.001]
}
In [407]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [408]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_nn_grid_search = GridSearchCV(
    estimator=blended_baselearner_nn_pipeline,
    param_grid=blended_baselearner_nn_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [409]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [410]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_nn_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[410]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=88888888,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'blended_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'blended_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_nn_model',
                                        MLPClassifier(max_iter=500,
                                                      random_state=88888888,
                                                      solver='lbfgs'))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_nn_model__activation': ['relu',
                                                                      'tanh'],
                         'blended_baselearner_nn_model__alpha': [0.0001, 0.001],
                         'blended_baselearner_nn_model__hidden_layer_sizes': [(50,),
                                                                              (100,)]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_nn_model',
                 MLPClassifier(hidden_layer_sizes=(50,), max_iter=500,
                               random_state=88888888, solver='lbfgs'))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=88888888,
              solver='lbfgs')
In [411]:
##################################
# Identifying the best model
##################################
blended_baselearner_nn_optimal = blended_baselearner_nn_grid_search.best_estimator_
In [412]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_nn_optimal_f1_cv = blended_baselearner_nn_grid_search.best_score_
blended_baselearner_nn_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
blended_baselearner_nn_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
In [413]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Neural Network: ')
print(f"Best Blended Base Learner Neural Network Hyperparameters: {blended_baselearner_nn_grid_search.best_params_}")
Best Blended Base Learner Neural Network: 
Best Blended Base Learner Neural Network Hyperparameters: {'blended_baselearner_nn_model__activation': 'relu', 'blended_baselearner_nn_model__alpha': 0.0001, 'blended_baselearner_nn_model__hidden_layer_sizes': (50,)}
In [414]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_nn_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_nn_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8674
F1 Score on Training Data: 0.9180

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       143
         1.0       0.92      0.92      0.92        61

    accuracy                           0.95       204
   macro avg       0.94      0.94      0.94       204
weighted avg       0.95      0.95      0.95       204

In [415]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [416]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_nn_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.7692

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.90      0.92      0.91        49
         1.0       0.79      0.75      0.77        20

    accuracy                           0.87        69
   macro avg       0.84      0.83      0.84        69
weighted avg       0.87      0.87      0.87        69

In [417]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_nn_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Neural Network Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [418]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_nn_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_neural_network_optimal.pkl"))
Out[418]:
['..\\models\\blended_model_baselearner_neural_network_optimal.pkl']

1.10.5 Base Learner - Decision Tree ¶

In [419]:
##################################
# Defining the categorical preprocessing parameters
##################################
categorical_features = ['Gender','Smoking','Physical_Examination','Adenopathy','Focality','Risk','T','Stage','Response']
categorical_transformer = OrdinalEncoder()
categorical_preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)],
                                             remainder='passthrough',
                                             force_int_remainder_cols=False)
In [420]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
blended_baselearner_dt_pipeline = Pipeline([
    ('categorical_preprocessor', categorical_preprocessor),
    ('blended_baselearner_dt_model', DecisionTreeClassifier(class_weight='balanced',
                                                            random_state=88888888))
])
In [421]:
##################################
# Defining hyperparameter grid
##################################
blended_baselearner_dt_hyperparameter_grid = {
    'blended_baselearner_dt_model__criterion': ['gini', 'entropy'],
    'blended_baselearner_dt_model__max_depth': [3, 5],
    'blended_baselearner_dt_model__min_samples_leaf': [5, 10]
}
In [422]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=88888888)
In [423]:
##################################
# Performing Grid Search with cross-validation
##################################
blended_baselearner_dt_grid_search = GridSearchCV(
    estimator=blended_baselearner_dt_pipeline,
    param_grid=blended_baselearner_dt_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [424]:
##################################
# Encoding the response variables
# for model evaluation
##################################
y_encoder = OrdinalEncoder()
y_encoder.fit(y_preprocessed_train.values.reshape(-1, 1))
y_preprocessed_train_encoded = y_encoder.transform(y_preprocessed_train.values.reshape(-1, 1)).ravel()
y_preprocessed_validation_encoded = y_encoder.transform(y_preprocessed_validation.values.reshape(-1, 1)).ravel()
In [425]:
##################################
# Fitting GridSearchCV
##################################
blended_baselearner_dt_grid_search.fit(X_preprocessed_train, y_preprocessed_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[425]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'blended_baselearner_dt_model__max_depth': [3, 5],
                         'blended_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=88888888),
             estimator=Pipeline(steps=[('categorical_preprocessor',
                                        ColumnTransformer(force_int_remainder_cols=False,
                                                          remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OrdinalEncoder(),
                                                                         ['Gender',
                                                                          'Smoking',
                                                                          'Physical_Examination',
                                                                          'Adenopathy',
                                                                          'Focality',
                                                                          'Risk',
                                                                          'T',
                                                                          'Stage',
                                                                          'Response'])])),
                                       ('blended_baselearner_dt_model',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=88888888))]),
             n_jobs=-1,
             param_grid={'blended_baselearner_dt_model__criterion': ['gini',
                                                                     'entropy'],
                         'blended_baselearner_dt_model__max_depth': [3, 5],
                         'blended_baselearner_dt_model__min_samples_leaf': [5,
                                                                            10]},
             scoring='f1', verbose=1)
Pipeline(steps=[('categorical_preprocessor',
                 ColumnTransformer(force_int_remainder_cols=False,
                                   remainder='passthrough',
                                   transformers=[('cat', OrdinalEncoder(),
                                                  ['Gender', 'Smoking',
                                                   'Physical_Examination',
                                                   'Adenopathy', 'Focality',
                                                   'Risk', 'T', 'Stage',
                                                   'Response'])])),
                ('blended_baselearner_dt_model',
                 DecisionTreeClassifier(class_weight='balanced',
                                        criterion='entropy', max_depth=3,
                                        min_samples_leaf=5,
                                        random_state=88888888))])
ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('cat', OrdinalEncoder(),
                                 ['Gender', 'Smoking', 'Physical_Examination',
                                  'Adenopathy', 'Focality', 'Risk', 'T',
                                  'Stage', 'Response'])])
['Gender', 'Smoking', 'Physical_Examination', 'Adenopathy', 'Focality', 'Risk', 'T', 'Stage', 'Response']
OrdinalEncoder()
['Age']
passthrough
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, min_samples_leaf=5, random_state=88888888)
In [426]:
##################################
# Identifying the best model
##################################
blended_baselearner_dt_optimal = blended_baselearner_dt_grid_search.best_estimator_
In [427]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
blended_baselearner_dt_optimal_f1_cv = blended_baselearner_dt_grid_search.best_score_
blended_baselearner_dt_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
blended_baselearner_dt_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
In [428]:
##################################
# Identifying the optimal model
##################################
print('Best Blended Base Learner Decision Trees: ')
print(f"Best Blended Base Learner Decision Trees Hyperparameters: {blended_baselearner_dt_grid_search.best_params_}")
Best Blended Base Learner Decision Trees: 
Best Blended Base Learner Decision Trees Hyperparameters: {'blended_baselearner_dt_model__criterion': 'entropy', 'blended_baselearner_dt_model__max_depth': 3, 'blended_baselearner_dt_model__min_samples_leaf': 5}
In [429]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {blended_baselearner_dt_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {blended_baselearner_dt_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train)))
F1 Score on Cross-Validated Data: 0.8767
F1 Score on Training Data: 0.8788

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.98      0.91      0.94       143
         1.0       0.82      0.95      0.88        61

    accuracy                           0.92       204
   macro avg       0.90      0.93      0.91       204
weighted avg       0.93      0.92      0.92       204

In [430]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [431]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {blended_baselearner_dt_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation)))
F1 Score on Validation Data: 0.8500

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        49
         1.0       0.85      0.85      0.85        20

    accuracy                           0.91        69
   macro avg       0.89      0.89      0.89        69
weighted avg       0.91      0.91      0.91        69

In [432]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_baselearner_dt_optimal.predict(X_preprocessed_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Base Learner Decision Trees Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [433]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(blended_baselearner_dt_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_baselearner_decision_trees_optimal.pkl"))
Out[433]:
['..\\models\\blended_model_baselearner_decision_trees_optimal.pkl']

1.10.6 Meta Learner - Logistic Regression ¶

In [434]:
##################################
# Defining the blending strategy (75-25 development-holdout split)
##################################
X_preprocessed_train_development, X_preprocessed_holdout, y_preprocessed_train_development, y_preprocessed_holdout = train_test_split(
    X_preprocessed_train, y_preprocessed_train_encoded, 
    test_size=0.25, 
    random_state=88888888
)
In [435]:
##################################
# Loading the pre-trained base learners
# from the previously saved pickle files
##################################
blended_baselearners = {}
blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
for name in blended_baselearner_model:
    blended_baselearner_model_path = os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl")
    blended_baselearners[name] = joblib.load(blended_baselearner_model_path)
    
In [436]:
##################################
# Initializing the meta-feature matrices
##################################
meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
meta_validation_blended = np.zeros((X_preprocessed_validation.shape[0], len(blended_baselearners)))
In [437]:
##################################
# Generating hold-out predictions for training the meta learner
##################################
for i, (name, model) in enumerate(blended_baselearners.items()):
    model.fit(X_preprocessed_train_development, y_preprocessed_train_development)  
    meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
    meta_validation_blended[:, i] = model.predict_proba(X_preprocessed_validation)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_validation)
In [438]:
##################################
# Training the meta learner on the stacked features
##################################
blended_metalearner_lr_optimal = LogisticRegression(class_weight='balanced', 
                                                    penalty='l2', 
                                                    C=1.0, 
                                                    solver='lbfgs', 
                                                    random_state=88888888)
blended_metalearner_lr_optimal.fit(meta_train_blended, y_preprocessed_holdout)
Out[438]:
LogisticRegression(class_weight='balanced', random_state=88888888)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', random_state=88888888)
In [439]:
##################################
# Saving the meta learner model
# developed from the meta-train data
################################## 
joblib.dump(blended_metalearner_lr_optimal, 
            os.path.join("..", MODELS_PATH, "blended_model_metalearner_logistic_regression_optimal.pkl"))
Out[439]:
['..\\models\\blended_model_metalearner_logistic_regression_optimal.pkl']
In [440]:
##################################
# Creating a function to extract the 
# meta-feature matrices for new data
################################## 
def extract_blended_metafeature_matrix(X_preprocessed_new):
    ##################################
    # Loading the pre-trained base learners
    # from the previously saved pickle files
    ##################################
    blended_baselearners = {}
    blended_baselearner_model = ['knn', 'svm', 'ridge_classifier', 'neural_network', 'decision_trees']
    for name in blended_baselearner_model:
        blended_baselearner_model_path = (os.path.join("..", MODELS_PATH, f"blended_model_baselearner_{name}_optimal.pkl"))
        blended_baselearners[name] = joblib.load(blended_baselearner_model_path)

    ##################################
    # Generating meta-features for new data
    ##################################
    meta_train_blended = np.zeros((X_preprocessed_holdout.shape[0], len(blended_baselearners)))
    meta_new_blended = np.zeros((X_preprocessed_new.shape[0], len(blended_baselearners)))

    ##################################
    # Generating holdout predictions
    # from the base learners
    ##################################
    for i, (name, model) in enumerate(blended_baselearners.items()):
        model.fit(X_preprocessed_train_development, y_preprocessed_train_development) 
        meta_train_blended[:, i] = model.predict_proba(X_preprocessed_holdout)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_holdout)
        meta_new_blended[:, i] = model.predict_proba(X_preprocessed_new)[:, 1] if hasattr(model, "predict_proba") else model.predict(X_preprocessed_new)

    return meta_new_blended
In [441]:
##################################
# Evaluating the F1 scores
# on the training and validation data
##################################
blended_metalearner_lr_optimal_f1_train = f1_score(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
blended_metalearner_lr_optimal_f1_validation = f1_score(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
In [442]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training data
# to assess overfitting optimism
##################################
print(f"F1 Score on Training Data: {blended_metalearner_lr_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train))))
F1 Score on Training Data: 0.8976

Classification Report on Train Data:
               precision    recall  f1-score   support

         0.0       0.97      0.94      0.95       143
         1.0       0.86      0.93      0.90        61

    accuracy                           0.94       204
   macro avg       0.92      0.94      0.93       204
weighted avg       0.94      0.94      0.94       204

In [443]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)))
cm_normalized = confusion_matrix(y_preprocessed_train_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_train)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [444]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validationing Data: {blended_metalearner_lr_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation))))
F1 Score on Validationing Data: 0.8293

Classification Report on Validation Data:
               precision    recall  f1-score   support

         0.0       0.94      0.92      0.93        49
         1.0       0.81      0.85      0.83        20

    accuracy                           0.90        69
   macro avg       0.87      0.88      0.88        69
weighted avg       0.90      0.90      0.90        69

In [445]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)))
cm_normalized = confusion_matrix(y_preprocessed_validation_encoded, blended_metalearner_lr_optimal.predict(extract_blended_metafeature_matrix(X_preprocessed_validation)), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Blended Meta Learner Logistic Regression Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image

1.11. Consolidated Summary¶

2. Summary ¶

3. References ¶

  • [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
  • [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
  • [Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
  • [Book] Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou
  • [Book] Effective XGBoost: Optimizing, Tuning, Understanding, and Deploying Classification Models (Treading on Python) by Matt Harrison, Edward Krueger, Alex Rook, Ronald Legere and Bojan Tunguz
  • [Python Library API] NumPy by NumPy Team
  • [Python Library API] pandas by Pandas Team
  • [Python Library API] seaborn by Seaborn Team
  • [Python Library API] matplotlib.pyplot by MatPlotLib Team
  • [Python Library API] matplotlib.image by MatPlotLib Team
  • [Python Library API] matplotlib.offsetbox by MatPlotLib Team
  • [Python Library API] itertools by Python Team
  • [Python Library API] operator by Python Team
  • [Python Library API] sklearn.experimental by Scikit-Learn Team
  • [Python Library API] sklearn.impute by Scikit-Learn Team
  • [Python Library API] sklearn.linear_model by Scikit-Learn Team
  • [Python Library API] sklearn.preprocessing by Scikit-Learn Team
  • [Python Library API] scipy by SciPy Team
  • [Python Library API] sklearn.tree by Scikit-Learn Team
  • [Python Library API] sklearn.ensemble by Scikit-Learn Team
  • [Python Library API] sklearn.svm by Scikit-Learn Team
  • [Python Library API] sklearn.metrics by Scikit-Learn Team
  • [Python Library API] sklearn.model_selection by Scikit-Learn Team
  • [Python Library API] imblearn.over_sampling by Imbalanced-Learn Team
  • [Python Library API] imblearn.under_sampling by Imbalanced-Learn Team
  • [Python Library API] StatsModels by StatsModels Team
  • [Python Library API] SciPy by SciPy Team
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Stacking Machine Learning: Everything You Need to Know by Ada Parker (MachineLearningPro.Org)
  • [Article] Ensemble Learning: Bagging, Boosting and Stacking by Edouard Duchesnay, Tommy Lofstedt and Feki Younes (Duchesnay.GitHub.IO)
  • [Article] Stack Machine Learning Models: Get Better Results by Casper Hansen (Developer.IBM.Com)
  • [Article] GradientBoosting vs AdaBoost vs XGBoost vs CatBoost vs LightGBM by Geeks for Geeks Team (GeeksForGeeks.Org)
  • [Article] A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] The Ultimate Guide to AdaBoost Algorithm | What is AdaBoost Algorithm? by Ashish Kumar (MyGreatLearning.Com)
  • [Article] A Gentle Introduction to Ensemble Learning Algorithms by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results by Necati Demir (Toptal.Com)
  • [Article] The Essential Guide to Ensemble Learning by Rohit Kundu (V7Labs.Com)
  • [Article] Develop an Intuition for How Ensemble Learning Works by by Jason Brownlee (Machine Learning Mastery)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Ensemble Learning: Bagging, Boosting, Stacking by Ayşe Kübra Kuyucu (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Aleyna Şenozan (Medium)
  • [Article] Boosting, Stacking, and Bagging for Ensemble Models for Time Series Analysis with Python by Kyle Jones (Medium)
  • [Article] Different types of Ensemble Techniques — Bagging, Boosting, Stacking, Voting, Blending by Abhishek Jain (Medium)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Understanding Ensemble Methods: Bagging, Boosting, and Stacking by Divya bhagat (Medium)
  • [Video Tutorial] BAGGING vs. BOOSTING vs STACKING in Ensemble Learning | Machine Learning by Gate Smashers (YouTube)
  • [Video Tutorial] What is Ensemble Method in Machine Learning | Bagging | Boosting | Stacking | Voting by Data_SPILL (YouTube)
  • [Video Tutorial] Ensemble Methods | Bagging | Boosting | Stacking by World of Signet (YouTube)
  • [Video Tutorial] Ensemble (Boosting, Bagging, and Stacking) in Machine Learning: Easy Explanation for Data Scientists by Emma Ding (YouTube)
  • [Video Tutorial] Ensemble Learning - Bagging, Boosting, and Stacking explained in 4 minutes! by Melissa Van Bussel (YouTube)
  • [Video Tutorial] Introduction to Ensemble Learning | Bagging , Boosting & Stacking Techniques by UncomplicatingTech (YouTube)
  • [Video Tutorial] Machine Learning Basics: Ensemble Learning: Bagging, Boosting, Stacking by ISSAI_NU (YouTube)
  • [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)
In [446]:
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))